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Abstract —The beyond worst-case threshold problem (BWC), 
recently introduced by Bruyere et al., asks given a quantitative 
game graph for the synthesis of a strategy that i) enforces some 
minimal level of performance against any adversary, and ii) 
achieves a good expectation against a stochastic model of the 
adversary. They solved the BWC problem for finite-memory 
strategies and unidimensional mean-payoff objectives and they 
showed membership of the problem in NPncoNP. They also 
noted that infinite-memory strategies are more powerful than 
finite-memory ones, but the respective threshold problem was 
left open. 

We extend these results in several directions. First, we consider 
multidimensional mean-payoff objectives. Second, we study both 
finite-memory and infinite-memory strategies. We show that the 
multidimensional BWC problem is coNP-complete in both cases. 
Third, in the special case when the worst-case objective is unidi¬ 
mensional (but the expectation objective is still multidimensional) 
we show that the complexity decreases to NPncoNP. This solves 
the infinite-memory threshold problem left open by Bruyere et 
al., and this complexity cannot be improved without improving 
the currently known complexity of classical mean-payoff games. 
Finally, we Introduce a natural relaxation of the BWC problem, 
the beyond almost-sure threshold problem (BAS), which asks for 
the synthesis of a strategy that ensures some minimal level 
of performance with probability one and a good expectation 
against the stochastic model of the adversary. We show that the 
multidimensional BAS threshold problem is solvable in P. 

I. Introduction 

In a two-player mean-payoff game played on a weighted 
graph [1], [2], given a threshold v G Q, we must decide if there 
exists a strategy for Player 1 (the controller) to force plays 
with mean-payoff values larger than v, against any strategy of 
Player 2 (the environment). In the beyond worst-case threshold 
problem (BWC), recently introduced by Bruyere et al. in [3], 
we are additionally given a stochastic model for the nominal, 
i.e. expected, behaviour of Player 2. Then we are asked, given 
two threshold values p, z/ G Q, to decide if there exists a 
strategy for Player 1 that forces (i) plays with a mean-payoff 
value larger than p against any strategy of Player 2, and (ii) an 
expected mean-payoff value larger than v when Player 2 plays 
according to the stochastic model of his nominal behaviour. In 
the BWC problem, we thus need to solve simultaneously a two 
player zero-sum game for the worst-case and an optimization 
problem where the adversary has been replaced by a stochastic 
model of his behaviour. 

BWC is a natural problem; in practice, we want to build 
systems that ensure good performances when the environ¬ 


ment exhibits his nominal behaviour, and at the same time, 
that ensure some minimal performances no matter how the 
environment behaves. In [3], the BWC problem is solved 
for finite-memory strategies and imidimensional mean-payoff 
objectives, and shown to be in NPncoNP. Also, it is noted 
there that infinite-memory strategies are more powerful than 
finite-memory ones, and that cannot even be approximated by 
the latter (already in the unidimensional case; cf. [4, Fig. 6] 
for an example). The respective threshold problem was left 
unsolved. We extend here these results in several directions. 

A. Contributions 

Our contributions are as follows. First, we consider d- 
dimensional mean-payoff objectives. Multiple dimensions are 
useful to model systems with multiple objectives that are 
potentially conflicting, and to analyze the possible trade-offs. 
For example, we may want to synthesize strategies that ensure 
a good QoS while keeping the energy consumption as low as 
possible. This extends the BWC problem with one additional 
level of conflicting trade-offs, which makes the analysis sub¬ 
stantially harder. Second, we study both finite-memory and 
infinite-memory strategies. We show that the multidimensional 
BWC problem is coNP-complete in both cases, and so not 
more expensive than the plain multidimensional mean-payoff 
games. This is obtained as a coNP reduction to the solu¬ 
tion of a linear system of inequalities of polynomial size. 
Correctness follows from non-trivial approximations results 
for finite/infinite-memory strategies inside end-components'. 
While in the unidimensional case optimal values for the 
expectation can always be achieved precisely (already by mem¬ 
oryless strategies), in our multidimensional setting this is not 
true anymore. To overcome this difficulty, we are able to show 
that achievable vectors can be approximated with arbitrary 
precision, which is sufficient for our analysis. Third, in the 
special case when the worst-case objective is unidimensional 
(but the expectation is still multidimensional), we show that the 
complexity decreases to NPncoNP. This solves with optimal 
complexity the infinite-memory threshold problem left open 
in [3]. Finally, we introduce the beyond almost-sure threshold 
problem (BAS) which is a natural relaxation of the BWC 
problem. The BAS problem asks, given two threshold values 

*Sub-MDPs which are strongly connected and closed w.r.t. the stochastic 
transitions. 


/I, S Q‘^, for the synthesis of a strategy for Player 1 that (i) 
ensures a mean-payoff larger than /2 almost surely, i.e. with 
probability one, and (ii) an expectation larger than v against 
the nominal behaviour of the environment. This problem has 
been independently considered (among other generalizations 
thereof) in [5]. We show that the multidimensional BAS 
threshold problem is solvable in P. As in the BWC problem, 
we reduce to a linear system of inequalities of polynomial size, 
but this time the reduction can be done in P. 

B. Related works 

Solutions to the expected unidimensional mean-payoff prob¬ 
lem in Markov Decision Processes (MDP) can be found for 
example in [6], it can be solved in P, and pure memory less 
strategies are sufficient to play optimally. The threshold prob¬ 
lem for unidimensional mean-payoff games was hrst studied 
in [2], pure memoryless optimal strategies exist for both 
players, and the associated decision problem can be solved 
in NPncoNP. As said above, BWC was introduced in [3] but 
studied only for finite memory strategies and unidimensional 
payoffs, the decision problem can be solve in NPflcoNP. 

Multidimensional mean-payoff games are investigated in [7], 
[8], where it is shown that infinite-memory controllers are 
more powerful than hnite-memory ones, and the finite-memory 
and general threshold problems are both coNP-complete. The 
expectation problem for multidimensional mean-payoff MDPs 
is in P, and finite-memory controllers always suffice [9]. 
Moreover, a recent study showed that one can add additional 
quantitative probability requirements for the mean-payoff to 
be above a certain threshold (while still optimizing the expec¬ 
tation), and that the resulting decision problem is P for the so- 
called joint interpretation (where the probability threshold is 
the same for all dimensions), and exponential for the conjunc¬ 
tion interpretation (each dimension has a different probability 
threshold) [5] (cf. also [10]). In both cases, inhnite-memory 
strategies are required to achieve the desired performance. 
Here, we study the multidimensional mean-payoff BWC thresh¬ 
old problem, for both hnite-memory and arbitrary controllers. 
Our BWC threshold problem generalizes both the synthesis 
problem for multidimensional mean-payoff games and for 
multidimensional mean-payoff MDPs with no additional cost 
in worst-case computational complexity. 

C. Illustrating example 

Consider the following task system [11]: There are two 
conhgurations (0 and 1), and at each interaction between the 
controller and its environment, one new instance of two kind of 
tasks can be generated (0 and 1). The two tasks are generated 
with equal probability 1/2 in the nominal behavior of the 
environment. Before serving pending task k G {0,1}, the 
system may decide to go from conhguration i to conhguration 
j at cost Qij, for i,j G {0,1}, and then it has to serve the 
pending task k from the new conhguration j at cost bjk- Thus, 
the total cost is Qij -\-bjk- Costs are bidimensional: Each cost 
specihes an amount of time and energy; the actual parameters 
are shown in Fig. la. For example, from conhguration 0, 


task 0 takes 30 time units to complete and it consumes 2 
energy units, while from the other conhguration the same task 
takes 2 time units and 10 energy units. We are interested in 
synthesizing controllers that optimize the expected/worst-case 
mean (i.e., per task) time and energy. There are trade-offs 
between the two measures: If the controller decides to serve 
a task quickly then the system consumes a large amount of 
energy, and vice versa. To analyze this example, we rephrase it 
as the multidimensional mean-payoff MDP depicted in Fig. lb. 
For example, state 0 represents the fact that the system is in 
conhguration 0 waiting for a task to arrive, while in state (0, 0) 
a task of the hrst type has arrived, and the controller needs 
to decide whether to serve it from the same conhguration, or 
go to conhguration 1. The objective of the controller is to 
guarantee worst-case mean time 24 under all circumstances, 
in that case the probabilities in the MDP are ignored and the 
probabilistic choice is replaced by an adversarial choice (we 
have thus a two-player zero sum game). Additionally, with 
the same strategy for the controller, we want to minimize the 
expected mean energy consumption in the nominal behaviour 
of the controller given by the stochastic model. If the controller 
decides to always serve tasks from conhguration 0, then it 
ensures an expected mean energy consumption of 3, but under 
this strategy, the worst-case mean time is 60, which does 
not meet our worst-case objective of 24. A strategy for the 
controller that is good both for the worst-case and for the 
expectation can be obtained as follows: For two parameters 
a, /I G N, stay in conhguration 0 for a consecutive tasks, then 
move to conhguration 1 for /3 tasks, and then repeat. This 
ensures worst-case time ^^60-1—7^64-1--^^8-1— \-a\Q and 
expected energy ^(i2 + i4) + ^(i4 + i6) + |^(il0 + 
^20) -I- + ^26). By taking a = 1 and /3 = 3, we 

obtain worst-case time 24 (thus meeting the requirement) and 
expected energy 14. Note the trade-off: To ensure a stronger 
guarantee on the mean time, we had to sacrihce the expected 
mean energy. 

In this paper we address the problem of deciding the 
existence of controllers ensuring a worst-case (or almost- 
sure) threshold, while, at the same time, achieving a usually 
better expectation threshold under the nominal behavior of the 
environment. We consider the class of multidimensional mean- 
payoff objectives. 

D. Structure of the paper 

In Sec. II, we present the preliminaries that are necessary to 
dehne the BWC and BAS problems. In Sec. Ill, we solve the 
BWC problem both for hnite and inhnite memory strategies. 
In Sec. IV, we solve the BAS problem and show that finite 
memory strategies are sufficient to achieve the BAS threshold 
problem. Finally, in Sec.V we conclude with some final 
remarks. Full proofs can be found in the appendices A and 
B. 

II. Preliminaries 

Fet N, Q, and K. be the set of natural, rational, and real 
numbers, respectively, and let K.±oo = K U j+cxj, — c»}. For 
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(a) Consumption of time (below) and 
energy (above). 



(b) Example of multidimensional mean-payoff MDP. 
Fig. 1: Illustrating example. 


two vectors /2 and F of the same dimension and a comparison 
operator {<,<,>,>}, we write /2 ^ Ffor the component¬ 
wise application of In particular, /7 > 0 means that every 
component of fl is strictly positive. A probability distribution 
on A is a function R : A ^ Q-° s.t. X]oe ,4 ^(o) = 1- The 
support of R is Supp(i?) = {a S A | R{a) > 0}. Let X>(A) 
be the set of probability distributions on A. 


A. Weighted graphs 

A multi-weighted graph is a tuple G = {d, S, E, w), where 
d > 1 is the dimension, 5" is a finite set of states, E C S x S 
is the set of directed edges, and w : Li —> is a function 

assigning to each edge a weight vector. When d = 1, we refer 
to G just as a weighted graph. With x[i\ we denote the i- 
th component of a vector x. For a state s G S, let E{s) = 
{t I {s,t) G E} be its set of successors. We assume that each 
state s has at least one successor. Let W be the largest absolute 
value of a weight appearing in the graph. 

A play in G is an infinite sequence of states tt = sqSi • • • 
s.t. {si,Si+i) G E for every i > 0. Let 17“^(G) be the set 
of plays in G starting at sq, and let 17“ (G) be the set of all 
plays of G. When G is clear from the context, we omit it. 
The prefix of length n of a play tt = sgSi ■ • ■ is the finite 
sequence 7r(n) = sgsi • • ■ Sn-i- For a set of states T G S, 
Let 17* (G, T) be the set of prefixes of plays in G ending in a 
state Sn-i G T. 

The total payojf and mean payojf up to length n of a play 
TT = sgsi • • ■ (or prefix of length at least n) are defined 
as TP„(7r) = i]"rg^iu(si,Si+i) and MP„(7r) = TTP„(7r), 
respectively. The (lim-inf) total and mean payoffs on an infinite 
play TT are then defined as TP(7r) := liminf„^oo TP„(7r) and 

MP(7r) := liminfn^oo MPn(7r). 


B. Markov decision processes 

A Markov decision process, or MDP, is a tuple Q = 
{G, , S^, R), where G = {d,S,E,w) is a multi-weighted 

graph, is a partition of S into states belonging 

to either the Controller player or to the Random player, 
respectively, and R : -G 27(5) is a function assigning 

a distribution over S to states belonging to Random s.t., for 
every s G S^, Supp(2?(s)) = E{s). We do not allow i?(s) to 
assign probability zero to any successor of s.^ Let Q be the 
largest denominator used to represent probabilities in R. We 
use Q as a measure of complexity for representing R. 

In order to discuss the complexity of strategies for Con¬ 
troller, we represent them as stochastic Moore machines. A 
strategy for a MDP Q = {G, , S^, R) is a tuple / = 

{M,a, fu, fo), consisting of a set of memory states M, the 
initial memory distribution a G 'D{M), the stochastic memory 
update function fu'.SxM^ 'D{M), and the stochastic out¬ 
put function fo : x M —F7(5), where Supp(/o(s, m)) C 

E{.s) for every s G and m G M. The update function is ex¬ 
tended to sequences /* : 5* —?> 'D{M) inductively as /*(e) = 
a and f*{TTs){m') = EmeM/«w)(m'). The 
output function on sequences /* : S*S^ —>• 27(5) is 
defined as /*(7rs)(s') = EmeM/«WM/o(s, m)(s')- A 
play TT = sgsi • • • is consistent with a Controller’s strategy 
/ if, and only if, for every i s.t. Si G 5^, we have G 
Supp(/*(sosi ■ • • Si)). Given a state sg and a Controller’s 
strategy /, the set of outcomes f7{g(G) is the set of plays 
starting at sg which are consistent with /. 

A strategy / is pure iff Supp(/„(s, m)) and Supp(/o(s, m)) 
are both singletons. A strategy / is memoryless iff \M\ = 1, 
finite-memory iff |M| < cjo, and infinite-memory iff |M| = oo. 

^This restriction will simplify the presentation. It is not present in [3], but 
it can be easily lifted. 
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Let A(^), Ap(^), Af( 0), and Apf(^) be the sets of all, resp., 
pure, finite-memory, and pure finite-memory strategies. 

C. Markov chains 

A Markov chain is an MDP where no state belongs to 
Controller, i.e., S'^ = 0, and in this case we just write 
Q = (C?, R). An event is a measurable set of plays A C QR[G). 
Given a state so and an event A C (G), let [A] be the 
probability that a play starting in so belongs to A, which exists 
and it is unique by Caratheodory’s extension theorem [12]. An 
even is almost sure if it has probability 1. For a measurable 
payoff function v : —)• R±oo’ expected 

value of u of a play starting in so. 

A Markov chain Q is unichain if it contains exactly one 
bottom strongly connected component (BSCC). Therefore, if 
Q is unichain, then all states in its unique BSCC are visited 
infinitely often almost surely, and the mean payoff equals its 
expected value almost surely. 

Given a MDP Q = (G^S^,S^^R) and a strat¬ 

egy / for Controller represented as the stochastic Moore 
machine {M,a, fu-, fo), let the induced Markov chain 
be g[f] = {G',R'), where G' = (d, S x M,w', E’) 
with ((s, m), (s', m')) € E' iff (s,s') & E, m' & 
Supp(/„(s, to)), and s' € Supp(/o(s, to)) whenever s G 
w'((s, to), (s', to')) = w{s,s') for every ((s, to), (s', to')) G 
E\ R'{s,m){s',m') = R{s){s') ■ /u(s,to)(to') for every 
s G S^, and R'{s,m){s',m') = fo{s,m){s') ■ fu{s,m){m') 
for every s G . Note that g[f] is hnite iff / is hnite- 
memory. By a slight abuse of terminology, we say that a 
strategy / is unichain if Q[f] is unichain. Plays in Q[f] can 
be mapped to plays in by a projection operator proj(-) : 
f2“(G') —r2“(G) which discards the memory of /. Given 
a state sq, a Controller’s strategy /, and an event A C 
let j: [A] := Pfy^ [proj“^(A)]. For a measurable payoff 

function v : n‘^{G) —>■ 1^1 / b] ■= [v'\, where 

v'[-k) := u(proj(7r)). 

D. End-components 

A end-component (EC) of a MDP ty is a set of states 
U C S s.t. a) the induced sub-graph {U, E C\U x U) is 
strongly-connected, and b) for any stochastic state s G 
E{s) C U. Thus, Controller can surely keep the game inside 
an EC, and almost surely visits all states therein. Eor an end- 
component U of Q, we denote hy Q [ U the MDP obtained 
by restricting ty to t7 in the natural way. ECs are central in 
the analysis of MDPs thanks to the following result. 

Proposition 1 (cf. [13]). For any Controller’s strategy f G 
A(ty), the set of states visited infinitely often when playing 
according to f is almost surely an EC. 

E. Expected-value objective 

Eor a MDP Q, a starting state sq. and Controller’s strategy 
/ G A(ty), the set of expected-value achievable solutions 
for f is ExpSolJ(so,/) = {F G | [MP] > v}, i.e., 
it is the set of vectors v s.t. Controller can guarantee an 
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(a) A MDP reduced to one EC. 


( 0 , 1 ) ( 1 , 0 ) 



(b) Exact strategy inducing two BSCCs. 
( 0 , 1 ) ( 0 , 1 ) ( 0 , 1 ) 



(c) Approximate finite-memory strategy inducing one BSCC. 
Pig. 2: Approximating the expectation inside ECs. 


expected mean payoff > v from state sq by playing /. The 
set of expected-value achievable solutions is ExpSolg(so) = 
U/6A(£;) ExpSolg (sq, /). Given a state sq and rational thresh¬ 
old vector V G the expected-value threshold problem asks 
whether F G ExpSolg(so). 

Theorem 1 ([9]). The expected-value threshold problem for 
multidimensional mean-payoff MDPs is in P. 

While randomized finite-memory strategies are both necessary 
and sufficient in general for achieving a given expected mean 
payoff, in ECs we can use randomized finite-memory unichain 
strategies to approximate achievable vectors. Being unichain 
ensures that the mean payoff equals the expectation almost 
surely. By standard convergence results in Markov chains, this 
entails that by playing such a strategy for sufficiently long time 
we obtain an average mean payoff close to the expectation 
with high probability. We crucially use this property in the 
constructions leading to the main results of Sec. Ill and IV; 
cf. Lemmas 3, 5, and 8 . 

Example 1. We illustrate the idea in the single end-component 
MDP in Fig. 2a (cf. [ 8 , Fig. 3]). There exists a simple 
randomized 2 -memory strategy / achieving expected mean 
payoff precisely ( 5 , ^) which decides, with equal probability, 
whether to stay forever in s or in t. However, the induced 
Markov chain has two BSCCs; cf. Fig. 2b. While intuitively no 
pure finite-memory strategy can achieve mean payoff exactly 
equal (i, i) in this example, finite-memory unichain strategies 
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can approximate this value. For a parameter >1 S N, consider 
the strategy qa which stays in s for A steps, and then 
goes to t, stays in t for A steps, and then goes back to s, 
and repeats this scheme forever. The induced Markov chain 
has only one BSCC, thus gA is unichain; cf. Fig. 2c. The 
strategy gA achieves expected (and worst-case) mean payoff 
( 2 ^ 4 + 2 ’ 2 A+ 2 )’ which converges from below to ( 5 , 5 ) as 

Lemma 1. Let Q be a multidimensional mean-payoffMDP, let 
So be a state in an EC U thereof, and let v € ExpSolg|^{j(so) 
be an expectation vector achievable by remaining inside U. 
There exists a hnite-memory unichain strategy g € A/r(C/) 
achieving the same expectation d £ ExpSolg|^jj(so, p). 

Remark 1. In the lemma above, we can take g to be even 
a pure hnite-memory unichain strategy. This can be obtained 
by a de-randomization technique at the cost of introducing 
extra memory of size exponential in the number of the states 
controlled by the player. However, we do not need this stronger 
result in the rest of the paper, and we content ourselves with 
randomized strategies for simplicity. 

Proof sketch. By the results of [9], there exists a randomized 
hnite-memory strategy / achieving expected mean payoff 
V* > V which surely stays inside U. However, {Q [ U)[f] 
is not unichain in general. By Proposition 1, the set of states 
visited inhnitely often by a play in {Q [ U)[f] is an EC almost 
surely. Since there are hnitely many different ECs, there are 
probabilities ai,..., > 0 and ECs Ui,... ,Un Q U s.t. the 

set of states visited inhnitely often by a play in {Q [ U)\f] 
is Ui with probability ai, ..., with probability a„. By 
Proposition l,ai + -- --|-a„ = l. In the hrst step, we dehne 
a “local” randomized memoryless strategy gi which plays as 
/ once inside Ui. No approximation is introduced in this 
step. In the second step, we combine the local randomized 
memoryless strategies gfs above. We build a randomized 
hnite-memory strategy g which cycles between t/i,..., C/„ 
and plays according to gi inside each Ui a fraction ss of the 
time. This is possible since Ui, Uj are almost surely mutually 
inter-reachable due to the fact that we are always inside the 
EC U. By construction, {Q [ U)\g\ is unichain since g cycles 
between all the ECs C/i,..., (7„. Moreover, for every e > 0, 
we can make the expected fraction of time spent changing 
component smaller than e. Thus, g achieves expected mean 
payoff at least [1 — e) ■ v* — [W,..., W) ■ e, where W is the 
largest absolute value of any weight in Q. The latter quantity 
can be made > d for sufficiently small e > 0 . □ 

F. Worst-case objective 

Eor a MDP Q, a starting state sq, and a Controller’s 
strategy / £ A(( 7 ), the set of worst-case achiev¬ 

able solutions for f is dehned as \NCSo\g {sq, f) = 
{/2 € I Vtt G ■ MP(7r) > /!}, i.e., it is the set of vectors 
jl s.t. Controller can surely guarantee a mean payoff > d 
from state sq by playing /. The set of worst-case achievable 
solutions is WCSolg (sg) = U/eA(c;) VVCSolg (sg, /)■ Given a 


state sg and rational threshold vector jl £ the worst-case 
threshold problem asks whether jl £ WCSolg (sq). 

With this worst-case interpretation, the randomized choices 
in the MDP are replaced by purely adversarial ones, and the 
MDP can thus be viewed as a two-player zero-sum game. 
While inhnite-memory strategies are more powerful than hnite- 
memory ones for the worst-case objective, the latter suffice to 
approximate achievable vectors. We make extensive use of this 
property in Sec. III-A where we restrict our attention to hnite- 
memory strategies. 

Lemma 2 (cf. Lemma 15 of [ 8 ]). Let Q be a multidimensional 
mean-payoff MDP, sg a state therein, and let jl £ WCSolg (sg). 
There exists a pure hnite-memory strategy f £ Apf{Q) for 
Controller s.t. jl £ WCSol0(sg, /). 

The hnite-memory strategy threshold problem for multidi¬ 
mensional mean-payoff games is coNP-complete [7], [ 8 ]. By 
the lemma above, hnite memory controllers suffice in our set¬ 
ting, and we obtain the following complexity characterization. 

Theorem 2 ([7], [ 8 ]). The worst-case threshold problem for 
multidimensional mean-payoff MDP s is coHP-complete. 

In the unidimensional case, memoryless strategies suffice for 
both players [1], [2], and the complexity is NPCcoNP (and 
even UPflcoUP [11], [14]). It is open since long time whether 
this problem is in P. 

Theorem 3 ([1], [2]). The worst-case threshold problem for 
unidimensional mean-payoff MDP s is in NPClcoNP. 

HI. Beyond worst-case synthesis 

We generalize [3] to the multidimensional setting. Given a 
MDP Q, a starting state sg, and a Controller’s strategy /, the 
set of beyond worst-case achievable solutions for /, denoted 
BWCSolg (sg,/), is the set of pairs of vectors (/I; F) G 
s.t. / surely guarantees a worst-case mean payoff > jl, and 
achieves an expected mean payoff > d starting from sg. 


BWCSol+(sg,/)= < 

{ff,d)£R^<^ 

^G WCS0l+(sg,/) ^ 
and 


[ 

Fg ExpSolJ(sg,/) J 


Let BWCSolg(sg) = U/gA(e) BWCSolJ(sg,/) be the set of 
beyond worst-case achievable solutions. Given a starting state 
Sg and a pair of threshold vectors {jl; d) £ R^'^, the beyond 
worst-case threshold problem (BWC) asks whether (/2; d) £ 
BWCSol+(sg). 

Remark 2. We assume w.l.o.g. that jl = 0. This follows 
by shifting each component by an appropriate amount. We 
further assume w.l.o.g. that d > 0. This follows from the fact 
that, since the mean payoff is surely > () by the worst-case 
objective, then also the expectation is > 0 . 

Remark 3. We say that Q is pruned if 0 G BWCSolg(s) 
for every state s therein. Controller cannot satisfy the BWC 
objective if she ever visits a state s not satisfying the worst- 
case objective. Many of our results are thus stated under the 
condition that Q is pruned. However, pruning an MDP, i.e.. 
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Fig. 3; Running example. 


removing those states which are losing w.r.t. the worst-case 
objective, requires solving a mean-payoff game, and this will 
have a crucial impact on the complexity. 

The finite-memory threshold problem for the unidimensional 
beyond worst-case problem has been studied in [3]. 

Theorem 4 ([3]). The finite-memory threshold problem for the 
unidimensional beyond worst-case problem for mean-payoff 
objectives is in NPncoNP. 

A. Finite-memory synthesis 

In this section, we address the problem of deciding whether 
there exists a finite-memory strategy for the BWC problem 
in the multidimensional setting. By Proposition 1, we know 
that the set of states visited infinitely often by any strategy 
(not necessarily a finite-memory one) is almost surely an 
EC. The crucial observation is that, when restricted to finite- 
memory, the same holds for ECs of a special kind. An EC U is 
winning (WEC) iff Controller can surely guarantee the worst- 
case threshold > 0 when constrained to remain in U, starting 
from any state therein. Whether a EC is winning depends on 
the worst-case objective alone. 

The following proposition is central in the analysis of the 
BWC problem for finite-memory strategies; cf. [3, Lemma 4] 
in the unidimensional case. 

Proposition 2. Let f be a finite-memory strategy satisfying 
the worst-case threshold problem. The set of states visited 
infinitely often under f is almost surely a winning EC. 

Running example. As a simple example that will be used 
through the rest of the paper, consider the MDP in Pig. 3. 
There are only two ECs U and V, of which U is winning, but 
V is not. Indeed, from v the adversary can always select the 
lower edge with payoff (30, —60). In U we can achieve expec¬ 
tation (5,15), and from V we can achieve expectation (15, 5). 
Therefore, according to the lemma above, any finite-memory 
strategy satisfying the worst-case objective will eventually go 
to U almost surely. 

We proceed by analyzing WECs separately in Sec. III-Al, 
and then we tackle general MPDs in Sec. III-A2. This will 
yield our complexity result in Sec. Ill-A3. 


1) Inside a WEC: We show that inside WECs finite- 
memory strategies always suffice for the BWC objective. 
In particular, the threshold problem in WECs immediately 
reduces to an expectation threshold problem. 

Lemma 3. Let Q be a pruned multidimensional mean-payoff 
MDP, let So be a state in a WEC W of Q, and let v G 
ExpSolgi^^y (so) with v > 0 be an expectation achievable by 
remaining inside W. There exists a randomized finite-memory 
strategy h G Af(Q) s.t. (0; u) G BWCSolg^^y (sq, h) that also 
remains inside W. 

Remark 4. The statement of the lemma holds even with h a 
pure finite-memory strategy, by applying Remark 1 when con¬ 
structing the expectation strategy which is part of h. However, 
randomized strategies suffice for our purposes. 

We use finite-memory strategies defined in WECs (such as 
h above) when constructing a global BWC strategy in the 
analysis of arbitrary MDPs in Sec. III-A2. The construction of 
h is done in a way analogous to the proof of Theorem 5 in [3]; 
cf. App. B for the details. However, the analysis in the multidi¬ 
mensional case is considerably more difficult than in previous 
work. It crucially relies on Lemma 1 for the extraction of finite- 
memory unichain strategies approximating the expectation 
objective inside ECs. Note that in the unidimensional case of 
[3] optimal expectation values can be reached exactly already 
by pure memoryless unichain strategies (no approximation 
needed). This is an key technical difference between our 
multidimensional setting and the unidimensional one of [3]. 

2) General case: We reduce the finite-memory BWC prob¬ 
lem to the solution of a system of linear inequalities. This 
is similar to the solution of the multidimensional expectation 
problem presented in [9]. When only the expectation is con¬ 
sidered, the intuition is that a “global expectation” is obtained 
by combining together “local expectations” achieved in ECs. 
Thus, a strategy for the expectation works in two phases: 
Phase I: Reach ECs with appropriate probabilities. 

Phase II: Once inside an EC, switch to a local expectation 
strategy to achieve the right “local expectation”. 

In the BWC problem, we need to enforce two extra con¬ 
ditions: First, only “local expectations” from winning ECs 
should be considered (by Proposition 2 finite-memory con¬ 
trollers cannot stay in a non-WEC forever with non-zero 
probability). Second, “local expectations” should be > 0 in 
order to satisfy the worst-case objective (a negative “local 
expectation” would violate the worst-case objective). Accord¬ 
ingly, a strategy for the BWC problem behaves as follows: 
Phase I: Reach WECs with appropriate probabilities. 

Phase II: Once inside a WEC, switch to a local BWC strategy 
to achieve the right “local expectation” > 0. 

We write a system of linear inequalities expressing this two- 
phase decomposition. W.l.o.g. we assume that state sq belongs 
to Controller, and that all WECs are reachable with positive 
probability from sq (unreachable states can be removed). 
Consider the system T in Eig. 4. Eor each state s € S we 
have a variable ys, and for each edge (s,f) G E we have 
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Eig. 4; Linear system T for the BWC finite-memory threshold problem. 



variables Xst and yaf System T can be divided into three 
parts. The first part consists of Equations (A1)-(A2). Variable 
ys represents the probability that, upon visiting state s, we 
switch to Phase II. Variables yst^ are used to express flow 
conditions. In Eq. (Al) we put an initial flow of 1 in sq, and 
we require that the total incoming flow to a state equals the 
outgoing flow (including the leak yg). In Eq. (AT) ensures that 
the outgoing flow through an edge yst from a stochastic state 
s is a fixed fraction of the incoming flow. Einally, Eq. (A2) 
states that we switch to Phase II in a WEC almost surely. 

Before explaining the other two parts of T, we need to 
introduce maximal WECs. A maximal WECs (MWEC) is a 
WEC which is not strictly included into another WEC. The 
restriction to MWECs is crucial for complexity. The second 
part of T consists of Eq. (B) and it provides a link between 
Phase I and Phase II. Variable Xst represents the long-run 
frequency of edge (s,f). Eq. (B) links the transient behaviour 
before switching inside a certain MWEC and the steady state 
behaviour once inside it. More precisely, it guarantees that the 
probability to switch inside a certain MWEC equals the total 
long-run frequency of all edges in the MWEC. 

Einally, the remaining equations make up the third part of 
T. Eq. (Cl) is a flow condition for the Xst’s., stating that the 
incoming flow to a state equals the outgoing flow. Eq. (Cl’) 
forces the flow to respect the probabilities of stochastic states. 
Eq. (C2) guarantees that the expected mean payoff is > i7, as 
required. Eq. (C3) needs some justification. It is specific to our 
setting and it does not follow from [9]. This equation specifies 
that the expected mean payoff is > 0 inside every MWECs. We 
need to ensure that only “local” expected mean payoffs > 0 
should be considered in WECs, in order to be able to apply 
the results from the previous Sec. III-Al. Eq. (C3) imposes 


a seemingly strong constraint by requiring that all WECs are 
visited infinitely often with positive probability. Ideally, we 
would like to guess which are the MWECs which need to be 
visited infinitely often with positive probability, but this would 
not yield a good complexity, since there are exponentially 
many different sets of MWECs. Instead, we require that 
every MWEC is visited infinitely often with some positive 
probability. Since we are only interested in approximating the 
expectation, it is always possible to put an arbitrary small total 
probability on MWECs that do not contribute to the “global” 
mean payoff. This is formalized below. 

Proposition 3. Let Q be a pruned multidimensional mean- 
payoff MDP. If there exists a finite-memory strategy h s.t. 
(0;i7) C BWCSolg (sq, (i), then there exists a finite-memory 
strategy h* with the same property, and such that, for every 
MWEC U, the set of states visited infinitely often by h* is a 
subset of U with positive probability. 

Proof. Since by assumption all MWEC are reachable with 
positive probability from sq, for every MWEC U there exists 
a strategy fu reaching U with positive probability from sq. 
Moreover, since (7 is a WEC, there exists a strategy fff^ for 
the worst-case objective > 0 that surely remains in U. Let 
be a worst-case strategy winning everywhere (it exists since Q 
is pruned by assumption). We construct the following strategy 
/at parametrized by a natural number N > 0: 

« Choose a MWEC U uniformly at random. 

• Play fu for N steps. 

- If after N steps the play is in U, then switch to 

- Otherwise, switch to 

By construction /jv is winning for the worst-case for every 
N > 0. Moreover, it is easy to see that there exists an 
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N* sufficiently large s.t., for every MWEC U, /at* visits U 
infinitely often with positive probability. 

Finally, the strategy h* plays with probability p > 0 ac¬ 
cording to /at* , and otherwise according to h. Since both /at* 
and h are winning for the worst-case, so it is h*. The expected 
mean payoff of h* converges from below to the expected mean 
payoff of h for p > 0 sufficiently small. Therefore, there exists 
p > 0 s.t. ( 0 ;f') e BWCSolg (sq,/ i*). □ 

We now state the correctness of the reduction. 

Lemma 4. Let Q be a pruned multidimensional mean-payoff 
MDP, let So G S, and let v >Q. There exists a finite-memory 
strategy h s.t. (0; d) G BWCSolJ (sq,/ i), if, and only if the 
system T has a non-negative solution. 

The rest of this section is devoted to the proof of the lemma 
above. Both directions are non-trivial. For the right-to-left 
direction, we need to explain which kind of strategies can 
be extracted from a non-negative solution of T. The following 
lemma shows that from a non-negative solution of T we can 
extract a strategy for the expectation combining only “local 
mean payoffs” > 0 and visiting infinitely often each MWFC 
with positive probability. 

Proposition 4 . If T has a non-negative solution, then there 
exists a finite-memory strategy h s.t. v G ExpSolg (sq, h), and 

1) For every MWEC U, there is a probability yf > 0 s.t. 
the set of states visited infinitely often by h is a subset of 
U with probability yf. 

2) Once h reaches the MWEC U, it achieves expected mean 
payoff vu > 0. 

'YIimwec u Vu ' > d. 

Proof Let {yt}s&S, {y*st}{s,t)€E, and be a non¬ 

negative solution to T. Proposition 4.2 of [9] essentially shows 
how to construct from the solution above a finite-memory 
strategy h s.t. d € ExpSolg (sq, ^)- 
For a MWEC U, let 

yh = Y.y*s ( 1 ) 

sGU 

By Eq. (C3), for every MWEC U there exist s,t € U s.t. 
xtt > 0 . Together with Eq. (B), this implies that y*u > 0 , 
which proves Point 1. 

For a MWEC U, let 

Pu = ^ xlfw{s,t) ( 2 ) 

{s,t)eEnuxU 

and notice that djj is the expected mean payoff of h once 
inside U. By Eq. (C3), djj > 0, which proves Point 2. 

Eq. (B) implies that h eventually stays forever inside a WEC 
almost surely. Consequently, X^mwec (/states 
visited infinitely often with probability zero do not contribute 
to the expected mean payoff, it suffices to look at MWECs. 
By the prefix independence of the mean payoff value function, 
and since MWEC U is reached with probability yf, strategy 


h achieves expected mean payoff X^mwecg Vu By Point 
1 ), the latter quantity is > □ 

We are now ready to prove Lemma 4. 

Proof of Lemma 4. For the left-to-right direction, assume 
that /i is a finite-memory strategy guaranteeing (0; d) G 
BWCSolg (s, h). Proposition 4.4 of [9] essentially shows that 
any strategy satisfying the expectation objective > d induces 
a solution to T satisfying Equations (A1)-(C2), except that 
Eq. (B) should be interpreted over MECs (instead of MWECs). 
(This follows from the fact that the set of states visited 
infinitely often by any strategy is an EC almost surely; cf. 
Proposition 1.) However, since 0 G WCSolg (s,/i) and h is 
finite-memory, we can apply Proposition 2 and deduce that h 
visits infinitely often a winning EC almost surely. Thus Eq. (B) 
is satisfied even over MWECs. 

It remains to address Eq. (C3). By Proposition 3, there exists 
a strategy h* s.t., for every MWEC U, h* eventually stays 
forever in U with a positive probability. This implies that, 
when constructing a solution to T induced by h* (as above), 
for every MWEC U and s, f G U, > 0. Moreover, since 
h* is winning for the worst-case, it achieves an expected mean 
payoff > () in {7, and thus Eq. (C3) is satisfied. 

For the right-to-left direction, assume that T has a non¬ 
negative solution. Let h be the strategy in Q given by Proposi¬ 
tion 4. For every MWEC U, let yf and djj be as given in the 
statement of the proposition. While h alone is not sufficient to 
show (0; d) G BWCSolJ (s) since it does not satisfy the worst- 
case objective in general, we show how to construct from 
it another finite-memory strategy ensuring the BWC 

objective. The latter strategy is obtained by combining together 
the following strategies; 

• Let /i“'° be a finite-memory strategy in Q ensuring the 
worst-case mean payoff 0 G WCSol J (t, from every 
state t in Q. This is possible since Q is pruned. 

« For each MWEC U, let hu be a finite-memory strategy 
s.t. ((); du) G BWCSolJ(<, hjj) for every state t € U. 
This strategy can be obtained as follows. Let Q [ U 
be the game Q restricted to the EC U. By Point 2 of 
Proposition 4, djj G ExpSolJ|^y(fo,/it/) for some state 
Iq G U. Since U is an EC, du G ExpSo\g{t, hu) for 
every state t € U. Since du > 0, we can apply Lemma 3 
for every t G U, and obtain a strategy ht s.t. ((); du) G 
BWCSolg|^y(f, ht). Let hu be the finite-memory strategy 
in 0 [ U that plays according to ht when starting from 
state t. Clearly, {0',du) G B\NCSo\g^jj{t, hu)- 
Consider the strategy h^^ parameterized by a natural number 
N > 0 which is defined as follows: 

1) Play according to h for N steps. 

2) After N steps: 

2a) If we are inside the MWEC U, then switch to hu. 
2b) Otherwise, play according to h™‘^. 

We argue that satisfies the beyond worst-case objective 
{0;d) G B\NCSo\g{so,h‘jff‘^) for N large enough. For every 
N, hfp^ clearly satisfies the worst-case objective, since after 
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N steps it switches to a strategy that satisfies it by construc¬ 
tion (by prefix-independence of the mean payoff objective). 
We now consider the expectation objective. By Point 1 of 
Proposition 4, the set of states visited infinitely often by 

is a subset of the MWEC U with probability y^. By 
taking N large enough, we can guarantee being inside U with 
probability arbitrarily close to y^. By construction, hjj can 
be chosen to achieve expected mean payoff arbitrarily close 
to z7j/. Since switches to hjj with probability arbitrarily 

close to y’^, achieves expected mean payoff arbitrarily 

close to X^MWEC uVu ' ^u- By Point 3 of Proposition 4, the 
latter quantity is > v. There exists N* large enough s.t. 
achieves expected mean payoff > v. Take = /i^^. As 

required, z7 G ExpSolg (so, □ 

Running example. Since U = {f} is a MWEC, while 
V = {m, u} is not, finite memory strategies must go to U . 
Therefore, with finite memory we can ensure BWC threshold 
((0,0); (0, 9)), but not ((0,0); (9,9)) for example. 

3) Complexity: We obtain the following complexity char¬ 
acterization for the threshold problem with finite-memory 
controllers. 

Theorem 5. The finite-memory multidimensional mean-payoff 
BWC threshold problem is coHP-complete. 

Proof. Pruning states where the worst-case objective cannot 
be satisfied requires solving multidimensional mean-payoff 
games, which can be done in coNP by Theorem 2. It has 
been already shown in [3] how the decomposition in MWEC 
can be performed in Pwith an oracle for solving mean-payoff 
games. Thus, the MWEC decomposition can be performed in 
coNP. System T has size polynomial in Q (there are only poly- 
nomially many MWECs) and it can thus be produced in coNP. 
By Lemma 4, it suffices to solve system T, which can be done 
in polynomial time by linear programming. The lower bound 
follows directly from the fact that the multidimensional BWC 
threshold problem contains the worst-case as a subproblem; 
the latter is coNP-hard as recalled in Theorem 2. □ 

The complexity of the BWC problem is dominated by the 
worst-case subproblem. We obtain an improved complexity 
by restricting the worst-case to be essentially unidimensional. 
Eormally, we say that a BWC threshold (p; z7) G has 
trivial worst-case component i, with 1 < i < d, iff fl[i] = —W, 
where W is the maximal absolute value of any weight in Q. 
We say that (/2; v) is essentially worst-case unidimensional iff 
it has at most one non-trivial worst-case component. We can 
ignore trivial components when solving a worst-case threshold 
problem. Thus, the worst-case problem for essentially uni¬ 
dimensional thresholds reduces to a simple unidimensional 
worst-case problem. As recalled in Theorem 3, the latter can 
be solved in NPncoNP, thus yielding the following improved 
complexity for the BWC problem. 

Corollary 1. The finite-memory multidimensional mean-payoff 
BWC threshold problem w.r.t. essentially worst-case unidimen¬ 
sional thresholds is in NPncoNP. 


Since the unidimensional BWC problem, i.e., where all 
weights are unidimensional, is in NPncoNP (cf. Theorem 4), 
this results shows that we can add a multidimensional ex¬ 
pectation objective to a unidimensional worst-case obligation 
without an extra price in complexity. In particular, we can 
model complex situations like the task system presented in 
Sec. I-C, where the worst-case and expectation mean payoffs 
are along independent dimensions. 

B. Infinite-memory synthesis 

Already in the unidimensional case, infinite-memory strate¬ 
gies are more powerful than finite-memory ones (cf. [4, Pig. 
6 ]). This is a consequence of the fact that finite-memory 
strategies for the BWC objective ultimately remain inside 
WECs almost surely (cf. Proposition 2). On the other hand, 
infinite-memory strategies can benefit from payoffs achievable 
inside arbitrary ECs. In this section, we address the problem 
of deciding whether there exists a general strategy, i.e., not 
necessarily finite-memory one, for the multidimensional BWC 
problem. This was left as an open problem, already in the 
unidimensional case [3]. As in the previous section, we first 
analyze ECs, and then general MDPs. 

1) Inside an EC: The lemma below is a direct generaliza¬ 
tion of Lemma 2 to arbitrary ECs. While for WECs we could 
construct finite-memory strategies, we now construct infinite- 
memory strategies for arbitrary ECs. 

Lemma 5. Let Q be a pruned multidimensional mean-payoff 
MDP, let So be a state in an EC U of MDP, and let P >0 be 
an expectation vector v G ExpSolg|^(^(so) which is achievable 
while remaining in U. There exists a strategy f G A((/) (not 
necessarily remaining in U) s.t. (0; E) G BWCSolJ(sp,/). 

Remark 5. The statement of the lemma holds even with / 
a pure strategy, by applying Remark 1 when constructing 
the expectation strategy below. However, randomized 

strategies suffice for our purposes. 

The rest of this section is devoted to the proof of Lemma 5. 
We proceed by combining in a non-trivial way a strategy for 
the expectation with a strategy for the worst-case. Let be a 
worst-case strategy s.t. 0 G WCSolJ(s,for every state s, 
which exists since the Q is pruned. Let f^^P be a expectation 
strategy s.t. v G ExpSolg|^[;(so,By Lemma 1, we can 
assume that f^^P is finite-memory and unichain. Por technical 
reasons, it is convenient to assume that ff^^P is finite-memory, 
even though we are going to construct a infinite-memory 
strategy. Moreover, since we are in an EC, we can further 
assume that achieves expectation > v from every state 
of the EC. 

The idea is to play according to two different modes. In the 
first mode, we play according to and in the second mode 
according to We start in the first mode, and possibly go to 
the second mode according to certain conditions. This happens 
with a certain probability, which we call switching probability. 
Once in the second mode, we remain in the second mode. 
In order to achieve an expectation arbitrarily close to that 
achieved by we need to be able to make the switching 
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probability arbitrarily small. At the same time, in order to 
ensure that the worst-case objective is satished, we need to 
guarantee that, when no switch occurs, the mean payoff is 
surely > 0. (If a switch occurs, the worst-case is satisfied by 
the dehnition of These two constraints are conflicting 

and make the construction of a combined strategy non-trivial. 

The combined strategy Jk is parameterized by a natural 
number AT > 0. In order to decide whether to switch to the 
second mode or not, we keep track of the total payoff since 
the beginning of the play as a vector in This value is 
unbounded in general, and this is explains why the strategy 
uses infinite memory. Let Ni be 


Thus, during the first mode the expected total payoff at the end 
of phase i is > 2 • The first mode is split into phases, 

each of length K. During phase i > 0, we play according to 
jexp ^ steps. There are two conditions that can 

trigger a switch to the second mode: 

[Switching condition 1 (SCI)] If we are in phase i > 1 and 
the total payoff since the beginning of the play is not 
always > Ni during the current phase, then switch to 
fwc permanently. 

[Switching condition 2 (SC2)] If the total payoff since the 
beginning of the play is not > 2 • iV^+i at the end of the 
current phase, then switch to permanently. 

What it remains to do is to show that we can choose K > 0 
in order to satisfy the BWC objective. First, we show that, for 
every choice of the parameter K, the combined strategy /k 
guarantees the worst-case objective. 

Proposition 5. For every AT S N and state Sq the EC U, 
OeWCSol+(so,/K). 

Proof. There are two cases to consider. If we ever switch to 
the second mode, then the run is eventually consistent with the 
worst-case strategy which guarantees worst-case mean 
payoff > 0 (by prefix independence). Otherwise, assume that 
we never leave the first mode. During phase i > 1 the total 
payoff is always > Ni = and the total length of the 

play is at most i ■ K. The average mean payoff during phase 
i is uniformly > The limit inferior of the average mean 
payoff is also > | > 0. □ 

We conclude by showing that AT can be chosen s.t. the 
combined strategy Jk achieves expected mean payoff > v. 


Lemma 6. There exists K gN s.t. v G ExpSolJ (sq;/ it:)- 

Proof. We show that we can choose a AT > 0 large enough s.t. 
the switching probability is negligible, and thus the impact of 
switching to the worst-case strategy on the expected mean 
payoff is also negligible. For now fix an arbitrary AT > 0, and 
consider the Markov chain G[fK]- Let pK be the probability 
to switch to the second mode due to SCI in any phase i > 1, 
and let qx be the probability to switch to the second mode 


due to SC2 in any phase i > 0. Thus, with probability at most 
1 — (1—pk) • (1 — 'lit') we switch to the second mode. By prefix 
independence of the mean payoff objective, the expected mean 
payoff achieved by fx satisfies: 

[MP] > (1 - (1 - pk) ■ (1 - Qk)) • [MP] + 

+(1 - Pk) ■ (1 - Qk) • Efojexp [MP] 

Since [MP] > v by definition, it suffices to show that 

both probabilities pK and qk can be made arbitrarily small. 
We argue about them separately. 

Let pi^K be the probability of switching to the second mode 
due to SCI during phase i > 1, i.e, the probability that the 
total payoff goes below A/ in any component: 


Pi,K 


— TBiOlfR 
So 


3(Ar ■ i < h < K ■ {i + 1)) ■ TPh f Ni 


Then, px = Pi,k + (1 -Pi,k) ■P2,K + (1 -pi.rr) • (1 -P 2 ,A:) • 

P3,K + • ■ •, and thus px < Pi,k + P2,k + _ We claim the 

following exponential upper bound on Pi^x- 


Claim 1. There are rational constants a and b with b < 1 s.t., 
for every i > \ and for sufficiently large K, pi^x £ o, ■ 6 ^'®. 
Note that a and b do not depend neither on K, nor on i. 


By the claim, px < a ■ {b^ + -f ...) < a • 6 ^/(1 — b^), 
and thus WuixPk = 0 since b^ <1. 

Let Qi^x be the probability of switching to the second mode 
due to SC2 at the end of phase i > 0. Thus, qi^x is the 
probability that, at the end of phase i, the total payoff is less 
than 2 • Ni+i = F • (* -f 1) • AT in any component: 


i,K 


„ laOlfK] 

So 


TPK.(i+i) f 2 • Ni+i 


We have qx = qo,K + {^ — qo,K)-qi,K + {^ — qo,K)-{^ — qi,K)- 
q 2 ,K + ■ • ■ ■ We show limbs' qx = 0 as in the last paragraph, 
by the following claim. 


Claim 2. There exist rational constants a and b with b < 1 s.t., 
for every z > 0 and sufficiently large K, qi^x < o, ■ 6 ^'b+i). 
Note that a and b do not depend neither on K, nor on i. 


Both claims are proved in the appendix. 


□ 


2) The general case: As in the synthesis for finite-memory 
strategies (cf. Sec. III-A), we reduce the infinite-memory BWC 
problem to the solution of a system of linear inequalities. The 
new system of equations T' is shown in Fig. 5. It is obtained 
as a modification of system T from the finite-memory case 
shown in Fig. 4: Specifically, T' is the same as T, except 
that Equations (A2), (B), and (C3) are interpreted w.r.t. MEC 
(instead of MWEC). The correctness of the reduction is stated 
in the lemma below. 


Lemma 7. Let Q be a pruned multidimensional mean-payoff 
MDP, let V > Q, and let Sq G S. There exists a (possibly 
infinite-memory) strategy h s.t. ( 0 ; F) G BWCSolg (sq, ^). if, 
and only if, the system T' has a non-negative solution. 

Proof sketch. The proof is analogous to the proof of Lemma 4. 
The crucial difference is that, by the modifications performed 
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Eig. 5: Linear system T' for the BWC infinite-memory threshold problem. 



to obtain T' from T, we obtain strategies which almost surely 
stay forever inside ECs, instead of WECs. Since we are 
allowed infinite-memory, we can approximate the BWC ob¬ 
jective inside ECs by replacing Lemma 2 with Lemma 5. □ 

Running example. An infinite-memory strategy can benefit 
both from the expectation (5,15) in U and from (15, 5) in 
V (which is not a WEC). By going to either EC with equal 
probability and playing according to a local BWC strategy, 
an infinite-memory strategy can ensure, for every e > 0, the 
BWC threshold ((0,0); (10 - e, 10 - e)). 

3) Complexity: We obtain the following complexity result 
for the threshold problem for arbitrary controllers. 

Theorem 6. The multidimensional mean-payoff BWC thresh¬ 
old problem is coHP-complete. 

Proof. Pruning the game to remove states which are losing for 
the worst-case objective requires solving a multidimensional 
mean-payoff game, which is coNP-complete by Theorem 2. 
Then, by Lemma 7, it suffices to solve system L'. Notice 
that system L' is of polynomial size since there are only 
polynomially many maximal ECs. □ 

Again, it is the worst-case problem that dominates the 
complexity of the BWC problem. By restricting to essentially 
worst-case unidimensional thresholds we obtain a better com¬ 
plexity, as in Sec. III-A3. 

Corollary 2. The multidimensional mean-payoff BWC thresh¬ 
old problem w.r.t. essentially worst-case unidimensional thresh¬ 
olds is in NPncoNP. 

This solves with optimal complexity the infinite-memory uni¬ 
dimensional BWC problem, which was left open in [3]. 


IV. Beyond almost-sure synthesis 


We introduce a natural relaxation of the BWC problem 
which enjoys a better complexity. Intuitively, we replace the 
worst-case objective in the BWC problem with a weaker 
almost sure objective. While the BWC problem is coNP- 
complete, we show that this relaxation can be solved in P, even 
in the multidimensional setting. A similar result has recently 
been obtained in [5]. Given an MDP Q, a starting state sq 
therein, and a Controller’s strategy / S A((/), the set of almost 
sure achievable solutions for /, denoted ASSol^g(so, f), is 
the set of vectors /2 € s.t. Controller can almost surely 


when playing accordin: 

' fIj [MP > /i] = 1 


guarantee mean payoff _> fl 
i.e., ASSol+c;(s,/) = |/2 g 1 

set of beyond almost-sure achievable solutions for /, 
BASSolg (sq, /), is the set of pairs of vectors (/2; v) 

Controller can almost surely guarantee mean payoff > fl and 
achieve expected mean payoff > z7 when starting from sq and 
playing according to /, i.e.. 



s.t. 


BASSol+(so,/) = < 

{fl-,u)€M.^^ 

/2 G ASSol+g(so,/) 1 

and 


[ 

Eg ExpSolJ(so,/) J 


The set of beyond almost-sure achievable solutions is 
BASSolJ(so) = |J/GA(a) BASSolJ(so,/). Given (p; E) e 
and a state Sq, the beyond almost-sure threshold problem 
asks whether {fi]v) S BASSolg(so). 

Remark 6. We assume w.l.o.g. that jl = 0 and P >0. The first 
condition is ensured by subtracting fl everywhere. The second 
condition follows from the observation that, if the mean payoff 
is > (l almost surely, then also the expectation is > (l surely. 


We observe that, inside an EC, there is no trade-off between 
the almost sure and the expectation objective. 
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Lemma 8. Let Q be a multidimensional mean-payojf MDP, let 
So be a state in an EC U thereof, and let v G ExpSolg|^f^(so) 
be an expectation achievable while remaining inside U. There 
exists a finite-memory strategy g G Af(Q) s.t. {v\ v) G 
BASSolgj^y (so,g) which also remains inside U. 

Proof. By Lemma 1, there exists a finite-memory strategy g 
s.t. V G ExpSolgi^y(so, g) and Q[g] is unichain. Consequently, 
the mean payoff is > ? almost surely. □ 

Thus, most of the effort goes in analyzing the general case. 
As in the BWC problem, we reduce the BAS problem to the 
solution of a system of linear inequalities. We assume that 
from state so all ECs are reachable with positive probability. 
It turns out that the same system of equations T' used in 
the infinite-memory BWC threshold problem also solves the 
BAS problem. We obtain a better complexity since we do 
not require the MDP to be pruned (which avoids solving an 
expensive mean-payoff game). 

Theorem 7. The multidimensional mean-payoff BAS threshold 
problem is in P. 

Proof The proof of correctness is the same as in Lemma 7, 
where Lemma 8 replaces Lemma 5 in the analysis of ECs. 
Crucially for complexity, we do not need to assume that 
the MDP is pruned. Therefore, system T' can be built (and 
solved) in P. Since Lemma 8 even yields finite-memory 
strategies inside an EC, the construction of Lemma 7 shows 
that finite-memory strategies suffice for the BAS threshold 
problem. (This relies on the strict BAS semantics. If non-strict 
inequalities are used, then the problem can still be solved in P 
but the construction above yields an infinite-memory strategy, 
and infinite-memory strategies are more powerful than finite- 
memory ones for the non-strict BAS problem; cf. also [5].) □ 

Running example. The BAS problem is strictly weaker than 
the BWC problem. Consider the MDP from Pig. 3 without the 
edge (m, t). This modification makes both states u and v losing 
for the worst-case, thus they are pruned away when solving the 
BWC problem (even with infinite memory). On the other hand, 
the mean payoff is almost surely (5,15) from V, and thus 
it satisfies the almost sure objective > (0,0). Therefore, for 
every £ > 0, we can achieve the BAS threshold ((0,0); (10 — 
£, 10 — £)) by going to f or m with equal probability. 

V. Conclusions 

In this paper, we studied the multidimensional generaliza¬ 
tion of the beyond worst-case problem introduced by Bruyere 
et al. [3]. We have provided tight coNP-completeness results 
for the this problem under both finite-memory and general 
strategies. Since multidimensional mean-payoff games are al¬ 
ready coNP-complete, our upper bound shows that we can add 
a multidimensional expectation optimization objective on top 
of a worst-case requirement without a coiTesponding increase 
in complexity. Notice that, while infinite-memory strategies 
were known to be more powerful than finite-memory ones 
already in the unidimensional setting [3], the corresponding 


synthesis problem was left open. Our results thus complete 
the complexity picture for this problem. Moreover, we showed 
that, when the worst-case objective is unidimensional, the 
complexity reduces to NPficoNP, and this holds even for 
multidimensional expectations. This generalizes with optimal 
complexity the NPficoNP upper bound for the unidimensional 
beyond worst-case problem [3]. Prom a practical point of 
view, our reductions to linear programming can be performed 
in pseudo-polynomial time by using the results of [15] for 
unidimensional mean-payoff games, and [16] for fixed number 
of dimensions. Purthermore, we introduced the beyond almost- 
sure problem as a natural relaxation of the beyond worst- 
case problem, by weakening the worst-case requirement to an 
almost-sure one. This natural relaxation enjoys a polynomial 
time solution and finite memory strategies always suffice. 
Moreover, our reduction to linear programming shows that 
the beyond almost-sure problem is amenable to be solved 
efficiently in practice, and thus it has the strongest appeal for 
practical applications. 
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Appendix A 
Preliminaries 


A. Hoejfding-style bounds 

In this section, we prove an Hoeffding-style bound for multidimensional Markov chains which will be used repeatedly in 
later proofs. Let t? be a Markov chain which is unichain. Recall that a Markov chain is unichain when it consists of transient 
states and a unique bottom strongly connected component. Thus, any run in Q will be trapped almost surely in the bottom 
component, and the mean payoff will be almost surely equal to the expected mean payoff. Moreover, the expected mean payoff 
is the same from every starting state of Q. Below, we present a bound on the probability that the mean payoff deviates from 
the expected mean payoff for sufficiently long runs. Let d > 1 be the dimension of the weights in the Markov chain Q, and 
let V be the expected mean-payoff vector from any state in Q. 

Lemma 9. For any d > 0, there exists Kq = 0{-^) G N and constants a,b > 0 s.t, for all K > Kq and state s, 

P? [3(1 <j<d)-\{MPK- i7)[i]| > <5] < a {K,S) -.= 2^-a- . 


Moreover, a and b are polynomial in the parameters of the Markov chain, a is exponential in 5, and Kq is polynomial in the 
size of the Markov chain and in the largest absolute weight W (and thus exponential in its encoding). 

We prove the lemma above by reducing to the unidimensional case d = 1. The latter case was already dealt with in [4, Lemma 
9], which in turns relies on [17, Proposition 2]. 

Lemma 10 (cf. [4, Lemma 9]). For any d > 0, there exist Kq = 0(|) S N and constants a,b > 0 s.t., for all K > Kq and 
state s, 

[\MPK-iz\> 6] <Ta.b{K, 6) := a • . 

Moreover, a and b are polynomial in the parameters of the Markov chain, a is exponential in S. and, Kq is polynomial in the 
size of the Markov chain and in the largest weight W (and thus exponential in its encoding). 

Proof of Lemma 9. In the following, hx an error d > 0 and a number of steps K > 0. For a component 1 < j < d, we say 
that a run tt is j-bad iff the j-th component of the mean payoff deviates from i7[j] at least by d after K steps, i.e., if 

|(MP(7r(iT))-i;)[j]|>d 

and TT is j-good otherwise. Moreover, we say that tt is bad if it is j-bad for some l< 3 <d, and we say that tt is good it is 
j-good for every 1 < J < d. In other words, tt is good if for every component j, |(MP(7r(Ar)) — v)\j]\ < d. 

For a fixed dimension 1 < j < d, we are in the unidimensional case, dealt with in Lemma 10. Let j be a hxed dimension. 
By Lemma 10, there exist constants aj,bj > 0, and Kj = 0(1-) G N s.t., for every K > Kj and state s, (df, d) is an 

upper bound on the probability that tt is j-bad. We want to choose uniform a,b > 0 and Kq s.t. 

(K,d) <Ta,b{K,S) < 1 


for every K > Kq. To this end, let 


a := maxoj, 

3 

b:=minbj, and 

3 

Kq := maxImaxATj, 


In a 

b~P 


+ !}• 


Note that Kq = O(^). Then, 1—Ta,b {K, d) is a lower bound on the probability that tt is j-good, for any fixed j and K > Kq. 
Then, (1 — 3Fa,b {K, d))'^ is a lower bound on the probability that tt is good. We derive the following simple lower bound on 
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the latter quantity; 


(1 - {K, 5)Y = E W {K, 6)y = 

i=0 

i^l ^ ^ 

i^l 

>l-2‘^-J^a,b{K,S) 

Finally, 1 — (1 — J^a,b {K, 6)Y is an upper bound on the probability that tt is bad. We define 

g {K, J) := 2^ • Fa,b [K, 5) = 2^-a- 

By the inequality above, Q {K,S) > 1 — (1 — Ta,b {K,5)Y, and thus Q (K,S) is an upper bound on the probability that a 
period is bad for every K > Kq. □ 

B. Mean-payoff value function 

We make a couple of simple observations on the relationship between the mean-payoff value function and the total-payoff 
value function. 


Lemma 11. Let n be a play in a graph G. 

1) //’MP(7r) > 0, then TP(7r) = -foo. Therefore, ifTPin) < -foo (with possibly TP(7r) = —oo), then MP(7r) < 0. 

2) Similarly, if MP(7r) < 0, then TP(7r) = —oo. 

3) There exists a play ttq in a finite graph Gq sJ. MP(7ro) = 0, but TP(7ro) = -|-oo. 

4) There exists a play tti in a finite graph Gi s.t. MP(7ri) = 0, but TP(7ri) = —oo. 

Proof We first prove Point 1). Assume MP( 7 r) = a for some a > 0. By the definition of liminf, for every e > 0 there 

exists mo s.t. for every m > mo we have MP( 7 r(m)) > a — e. To show TP( 7 r) = -|-oo we show that for every bound 6 > 0, 
there exists uq s.t. for every n > no, TP( 7 r(n)) > b. Let 6 > 0. If we take e := a/2 in the definition above, we have that 
there exists mo s.t. for every m > mo MP( 7 r(m)) > a/2 > 0. We take no := maxjmo, 26/a}, and let n > no. Since 
TP( 7 r(n)) = n ■ MP( 7 r(n)), we have TP( 7 r(n)) > na/2 > b, where the latter inequality follows from the definition of no. 

The proof of Point 2) is analogous to the proof of Point 1). 

For Point 3), consider a play tto inducing the following sequence of payoffs; 


, 1 „ 10 , 1000 10000000 • • • 

20 21 2 ^ 23 

i.e., the n-th payoff is 1 if n is a power of 2, and 0 otherwise. Then, TP(' 7 ro(n)) = k where k is the largest exponent s.t. 
2^ < n, i.e., k = [IgnJ. Thus, TP( 7 ro) = -|-c». However, MP( 7 ro(n)) = goes to 0 as n goes to -foo. Thus, MP( 7 ro) = 0. 
Point 4) is proved analogously by taking the sequence 

(- 1 ) (- 1)0 (- 1)000 (- 1)0000000 ■■ • 

20 21 23 23 

□ 


As an application of Lemma 10, we show that if the total payoff is —oo almost surely, then the mean payoff is strictly 
negative almost surely. This contrasts with Point 4) in Lemma 11 , which showed that there are infinite runs with total payoff 
equal to —oo, but which have nonetheless zero mean payoff. (Notice that the infinite play constructed in the proof of the latter 
lemma with this property was non-periodic.) We use the lemma below later in the proof of Lemma 21. 

Lemma 12. Let Q be a Markov chain. For every state sp, 

PfjTP = -oo]=P® [MP<0] . 

In particular, if the mean payoff is non-negative almost surely, then the total payoff is > —oo almost surely. 


14 





S azs = 1 


(EC-1) 

es 

Xs — ^ ^ X-rs 

Vs gS 

(EC-IN) 

(r,s)£i5 

^ ^ ^st 

Vs gS 

(EC-OUT) 

(s,t)GE 

Xst = R{s){t) • Xs 

V{s,t) G E with s G 

(EC-RAND) 

> P[i\ 

V(1 < i < d) 

(EC-MP) 


(s,i)^E 

Fig. 6 : System of linear inequalities for the expectation problem inside an EC. 


Proof. Fix a state sq. Let p = [MP < 0], g = P^^^ [TP = —oo], and, for a BSCC B, let pB be the probability of reaching 
B. In a BSCC S, MP and TP take value equal to their respective expectations almost surely, and this value is the same from 
every state in the BSCC. Let this value be [MP] and E^ [TP], respectively. We thus have the following decomposition: 


p= PB, and 

BSCC B s.t. E|[MP]<0 

9 = E PB ■ 

BSCC B s.t. E|[TP] = -oo 


It suffices to show that E^ [MP] <0 if, and only if, E^ [TP] = —oo for all BSCC’s B. 

If E^ [MP] < 0, then MP < 0 almost surely since i? is a BSCC. By Point 2) of Lemma 11, MP < 0 implies TP = —oo 
surely, and thus TP = — oo holds almost surely, and consequently Eg [TP] = — oo. 

For the other direction, let Eg [MP] > 0. If Eg [MP] > 0, then by Point 1) of Lemma 11 and reasoning as above, we obtain 
Eg [TP] = +00. It remains to prove the case Eg [MP] = 0. This makes use of the bound provided by Lemma 10, and it does 
not hold in a non-probabilistic setting (cf. the counter-example in Point 4) of Lemma 11). Assume Eg [MP] = 0. We prove 
E| [TP] > — oo. For every si £ B and K, we have 

Pf, [TPk < -K] < P® [MP;^ < -1] 

- ru^p^i > ^ 


<pe 


< J^a.b (K, 1 ) 


where the last inequality follows from Lemma 10 applied with z/ = Eg [MP] = 0, (5 = 1, for some constants a,b > 0 and for K 
sufficiently large. Since Ba,b {K, 1) —>• 0 as AT —oo, we have that Pf^ [TP = —oo] = 0, which implies Eg [TP] > —oo. □ 


C. Finite-memory synthesis in an EC 

In this section, we show that achievable values can be approximated by randomized finite-memory strategies in ECs, with 
the further property that the induced (finite) Markov chain is unichain, i.e., it contains exactly one BSCC. 

Lemma 1. Let Q be a multidimensional mean-payoff MDP, let sq be a state in an EC U thereof and let v € ExpSol 5 |^g(so) be 
an expectation vector achievable by remaining inside U. There exists a finite-memory unichain strategy g S Af{0) achieving 
the same expectation P G ExpSolg|^g(so, p). 

We prove this result as follows. First, in Sec. A-Cl we characterize the set of of achievable vectors as non-negative solutions 
to a linear programming problem (in the spirit of [9]). This yields a natural decomposition of the EC into several SCCs. 
For each such SCC, we construct in Sec. A-C2 a randomized memoryless “local strategy” achieving a corresponding “local 
expectation”. No approximation error is introduced in this step. Then, in Sec. A-C3 we combine those “local strategies” into 
a randomized finite-memory “global strategy” approximating the expectation. This second step uses the fact that in an EC all 
states are inter-reachable (under some strategy), and thus we can cycle through all the “local strategies” for the appropriate 
fraction of time. This step introduces an approximation error, due to the cost of moving from a SCC to the next one. However, 
by using larger amounts of finite memory, we can make this error arbitrarily small. 


15 


1) Decomposition in SCCs: In the following, let Q = {G, S'^, S^, R) with G = {d, S, E,w) be a fixed MDP. W.l.o.g. we 
assume that Q is reduced to a single EC S. Let i/ G Q'^ be an expected-value achievable vector. Consider the linear program 
Ajy of Fig. 6. (Cf. [9] for a similar linear program in the more general case where the MDP is not just an EC.). We use the 
linear program Ajr to obtain the long-run “frequencies” of edges guaranteeing mean payoff E For each state s € S', we have 
a variable Xg representing the long-run probability to be in s, and, for each edge (s, t) G E, we have a variable Xst for the 
long-run probability of taking edge {s,t). In the following, let 

ExpSole(so, /) = {? € R'" I EIj [MP] > r?} , 

ExpSol 5 (so) = y ExpSolg(so,/) . 

/eA(e) 

The lemma below shows that Ap has a non-negative solution if id is achievable in Q. Correctness follows directly from the 
analysis of [9]. The complexity of the solution follows from [18, Theorem 10.1]. In the statement below, recall that W is the 
maximum absolute value of any weight in Q, and Q is be the largest denominator of any probability appearing therein. 

Lemma 13. Let Q be a multidimensional mean payoff MDP reduced to a single EC S, let sq G S be a state therein, and let 
V G ExpSolg(so) be an achievable value for the expectation. Then, the system Ap has a non-negative solution of size^ bounded 
by a polynomial in W and Q. 

Let {xst}(s,t)e£;) {is}seS a non-negative solution to Ap of complexity polynomial in W and Q. This allows us to 
perform the following decomposition of Q into strongly connected components. Let S>o be the set of states visited with 
(strictly) positive long-run average probability, and let i?>o be the set of edges visited with (strictly) positive long-run average 
probability: 

S><d = {s e S' I Xg > 0} 

£^>o = G l; I Xgt > 0} 

By the flow conditions of Ap , i?>o = ED (S>o x S>o), i.e., positive edges are exactly those connecting positive states. States 
in S>o can be partitioned into maximal strongly connected components {Si,..., S^} (w.r.t. Eyf), such that there is no positive 
edge between different components. Since states in Si have at least one successor (as Si C S>o) and all successors are in fact 
inside Si, we have that Si is an EC in Q. Let Ei = E^q Ci Si x Si be the restriction of E^q to Si, let Gi = {d, Si,Ei, w) be 
the corresponding graph, and let Qi = {Gt, Sf, S/^, R) be the resulting MDP, where S? = S° fl Si and S/^ = S^ fl Si. 

For each i G (1,..., k}, let Xi > 0 be the total long-run average probability of being in Si, and let Pi be the expected mean 
payoff vector achieved when starting from (anywhere) in component Sp. 

Xj = y^xs > 0 

eeSi 

Vi = — ■ ^ Xgt • w{s,f) 

{s,t)eEi 

Since component Si cannot be left, it is reached with probability Xi, and thus the expected mean payoff is 'Yl!i=i ^ 

Remark 7. This analysis immediately yields a 2-memory randomized strategy achieving expected mean payoff d. Such a 
strategy goes to SCC Si with probability Xi, and then plays edge {s,t) with probability Xst/xg- Two memory states are 
required to discriminate the two phases. However, such a strategy is not unichain in general, which is what we aim at in this 
section. 

We design a randomized finite memory unichain strategy that plays edges (s, t) G £'>o with approximate long-run average 
frequency Xg*, in order to have mean payoff close to P. We do this in two steps. First, in Sec. A-C2 we design, for each SCC 
Si, a randomized memoryless “local strategy” gi which plays edge {s,t) G Ei with long-run average frequency Xstlxi when 
started inside Si. Then, in Sec. A-C3 we combine those gfs into a global strategy that spends in each Sfs an approximate 
long-run fraction of time Xi. By using larger amounts of memory, the error in this approximation can be made arbitrarily small. 

2) Inside a SCC Si (Strategy gi): For each SCC Si, let gi be the randomized memoryless strategy that plays edge (s, t) G Ei 
with s G 5° with probability Xst/xs- Thus, in Gi[gi] edge {s,t) G Ei is visited for a long-run proportion of time Xgtlxi. 

Lemma 14. Qi [gi\ is recurrent, and for every state sq G Si, 

[MP] = 

^The size of a non-negative rational number x = pjq with p, g G N relatively prime, and g > 0, is the size of their bit representation, 1 -|- [log 2 (p + 1)] + 
[ 1082(9 + 1)1’ [1^’ Section 3.2]. 


(3) 

(4) 
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Gi 

play h 2 

G2 

play hs 

Gs 

play gi 

for A • Cl steps 


play g 2 

for A ■ C 2 steps 


play g 3 

^ for A • C 3 steps 


play hi 

Fig. 7: The global strategy 


In the following lemma, we show the mean payoff obtained by playing according to gi for sufficiently long time is close to 
Vi with high probability. The constant Lq in the statement of the lemma does not depend on Si, thus the guarantee holds in 
every component Si. The lemma follows from a Hoeffding-style analysis. 

Lemma 15. For every (5 > 0, there exists Lq = O(^) € N and constants a,b > 0 s.t, for every component Si, for every 
L > Lq and for every state Sg G Si, 

[3(1 < j < d) • |(MPi - Vi)[f\\ >5]<g iL,S) -.= 2^-a- . 

Proof The lemma follows from an application of Lemma 9 to each component Si separately, and then by aggregating the 
constants. More precisely, for each irreducible (and thus unichain) Markov chain Gilgi], 1 < i < k, Lemma 9 provides Li 
(called Kg in the lemma) and constants Oi, h > 0 s.t. for every L > Li and state sq G Si, 

P?;,, [3(1 <J<d)- |(MPi - F,)[j]| > <5] < a iL,6) := 2“^ ■ a. • . 

Just take Lg := max{Li,..., Lk}, a := maxjai,..., ak}, and b := min{&i,..., bk} to satisfy the claim. □ 

3) Across SCCs (The global strategy): We now combine the local strategies gfs in order to achieve approximate expected 
mean payoff ft with finite memory. For each SCC Si, let hi be a memoryless strategy ensuring that Si is reached almost surely 
from any state in Q. The strategy gA is parametrized by a natural number A > 0. Assume that Xi is of the form Xi = Oi/bi, 
with Qi, bi gN relatively prime, bi > 0 , let b = lcm{ 6 i,..., bk}, and let Ci = b-Xi. Note that Ci is a natural number. Intuitively, 
gA works in k different stages. In stage i G {1,..., k}, gA does the following: 

(a) Play hi to reach Si almost surely. 

(b) Once in Si, switch to strategy gi for A ■ Ci steps. Then, switch to stage (i mod fc) + 1 and go to (a). 

A full repetition of stages {1,... ,k} is called a phase. Intuitively, gA spends a proportion of time Xi in Si in the limit, and, 
while the game stays in Si, gA plays according to gi. Recall that gi is memoryless. 

Remark 8 . Strategy gA can be implemented with memory bounded by k ■ A ■ b. Notice that in both (a) and (b), gA plays 
according to a memoryless strategy, and no memory is needed to distinguish (a) from (b) since it suffices to look at the current 
state. Since the size of the binary representation of b is polynomial in W and Q (cf. Lemma 13), strategy gA uses memory 
exponential in W and Q, and linear in n and A, where n is the number of states of G. 

We show that for any additive error e > 0, we can play each stage sufficiently long (by increasing the parameter A) s.t. the 
probability of deviating from the expected mean payoff v by more than e in any component is small. 

Lemma 16. For any achievable vector it G ExpSol 0 (so) and e > 0, there exists an A^ G N (= 0{^)) s.t., for every A > A^, 
(v-e) G ExpSo\g(sg,gA). 

Proof. We begin by analysing the expected mean payoff of gA over a single phase. Let e be the expected mean payoff of a 
single phase. We prove that for every e > 0 there exists A large enough s.t. strategy gA achieves at least expected mean payoff 
it — £ over a single phase. 

Let I be an upper bound on the expected length of periods of type (a). That a finite such I exists can be seen as follows. 
Formally, let Hs^i be the random variable that returns the first hitting time of the set Si when starting from s G S, that is 
the number of steps to reach Si. Let k = maxggsE®^ [Hs,i] be the worst expected first hitting time of Si from any state 
in S when playing according to the memoryless strategy hi (which reaches Si almost surely), and take I = '^he 

definition of hi and standard results about hitting times, the value I is finite. 

The length of a period of type (h) at stage i is A ■ Ci, thus the total length of periods of type (b) over a single phase is 
'Yl!i=i A ■ Ci = A ■ b. Since I (computed above) is the expected length of periods of type (a) over a single phase, the expected 
length of a single phase is at most A ■ b + 1. 
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Let e(b),i be the expected mean payoff of a period of type (b) at stage i. Apply Lemma 15 with S := e/8 > 0, and let Lq 
as given by the lemma. Thus, for every component Si, if we play gi for time L > Lq, we have the following lower bound on 

e(b),i > (1 - 0 (L, e/8)) • (i^i - e/8) + G {L, e/8) • (-1L) 

where W is the largest absolute value of weights in Q and W = {W,... ,W) € Moreover, since G{L,e/8) —>■ 0 as 
L —>■ oo, there exists an L^, > Lq s.t., for every L > L*, {l — Q {L, e/8)) • {ui — e/8) + G {L, e/8) • {—W) > i/ — e/4:, and thus 

e(b),i >Vi- e/4 (5) 


for every L > L^,. We derive a precise bound for L*. 


if 

if 

if 

if 

if 

if 

if 

if 


{1-G {L, e/8)) • {v, - e/8) + G {L, e/8) ■ {-W) > v, 
Vi - e/8 - G (L,e/8) • {vi - e/8 + W) > Vi - e/4 
e/8 - G {L, e/8) • {V, - e/8 + W)>0 
e/8 - G {L, e/8) ■ - e/8 + IL) > 0 


GiL,e/8) < 


8j,max _ £ 8PP 




g-fc-i-eV64 < 


b-L-e^/64 


< 


8i/”“ - e + 8VL 
1 e 


-b-L- eV64 < In 
64 


a ■ 2<i 8z/f“ - e + 8fL 

1 e 


L > 


b ■ e^ 
64 


■In 


a ■ 2^ Sz/”” - e + 81L 
a-2‘^- (8z/”“ - e + 8W) 


b • e^ 


(In a + d • In 2 + ln(8r'““ - e + 8VL) - In e) 


e/4 


where z/™” = max{z7i[l], ■ • ■ ^Vild]} is the largest component of z/ and G (L,e/8) = a ■ 2'^ ■ e Thus, take L* := 

(Ina + c? ■ ln2 + ln(8z'™“ — e + 81L) — Ine). Notice that L* = *^(e) (since a is exponential in e). 

Therefore, we stay in stage i at least L* number of steps, which implies that we should have A-a > L* for every 1 < i < A:. 


Assumption 1. A> Aq := max{L», Lq} = 0{^). 


By the dehnition of Lq, Aq is exponential in n (the number of states of G), and polynomial in W and Q. 

For a period of type (b) in stage i, the expected total payoff is A • Cj • e(b),i. The expected total payoff of all periods of type 
(a) during a single phase is at least —W ■ 1. Thus, 


e > 


A • Cl • e(b}^i + • • • + A • Cfe • e(b),k -W-l 
A-b + l 


• S(b),i + ■ ■ ■ + Xk ■ e(b),k 


1 + 


A-b 


W-l 
A-b + l 


First, we choose A large enough s.t. (A) 

W-l 

< 


A-b + l 


W-l ^ 

A-b+l — 

e. 

2' 

£ 

w-l e 

2 

A-b + l ^ 2 

iff 

2-W-l<e-{A-b + l) 

iff 

, l-{2-W-e) 

e ' b 


(^e - b) - A + e -1 


Assumption 2. A > Ai := ^•(2-tv e) _ o{^). 
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When A> Ai also have (B) e(b)^i/{l + -^) > Vi — 

iff 

iff 

iff 

iff 

if 


|, for every 1 < i < k. Indeed, 

- 1 ) Z 0 + 3 ^) - 
'’i' (■’•-f) 

Vj i.A-b>i-h,\j\-^) 

^>r(2g.b-]-£) 

“ e • & 


£ 

” 2 

(A-6 + ?) 


where the first step follows from the inequality in Eq. 5, and the last step follows from since, for an 

achievable vector i/i, it must clearly be the case that i/i[j] < W, the largest absolute value of any weight in the game. 

Thanks to the two assumptions above, we can derive the following bound on e: 


^1 • <^(b).i + • • • + • effcj.fc _ W ■ I 


■ e(b).l + ■ ■ ■ + Xk ■ &(b),k £ 



= Xi-vi-\ - \-Xk ■ i^k-s 


= V — e 


where the second inequality follows from (A), and the third inequality follows from (B). We also use the property that 
Si=i Xi = l and xi-vi^ - V Xk ■ Vk = v- 

Now that we have a lower bound on the expected mean payoff over a single phase, we can show that the expected mean 
payoff of QA over longer and longer prefixes of an infinite run (thus spanning many phases) converges to e, from which the 
lemma follows. Notice that the expected mean payoff over m phases is simply m ■ e. Let e„ be the expected mean payoff of 
gA in the first n steps. Since in n steps there are an expected number of 
obtain 


A-b+l 


phases of expected length A-h-\-l each, we 


> 


A-b+l 


\ ■ (A-b + l) ■ e+(n-[^^\-(A-b + l))-i-W) 


^i\-(A-b + l)-e-(A-b + l)-W 

~ n 

^ (n-(A-b + l))-e-(A-b + l)-W 

~ n 

and thus liminf„ e„ = e. To conclude, take A^ := max{Ao, Ai} = O(^), and the claim is satisfied for any A > A^. 
Lemma 17. For every A > n, (/[pa] is unichain. 


□ 


Proof. Each Qi[gi\ is recurrent by Lemma 14. By the definition of gA, G[gA] is obtained by staying in Qi[gi\ for a certain 
number of steps and then going to Gj[gj] with j = (i mod k) + l almost surely. If we stay in Gi[gi] for at least n steps, then 
with positive probability we can visit every state therein. Thus, G[gA] is unichain. □ 


In the following, for every e > 0, we denote by pe the strategy Pmax{n,Ae}j where A^ is the bound provided by Lemma 16. 

Proof of Lemma 1. W.l.o.g. we assume that G reduces to a single end-component S. Let v G ExpSolg(so)- There exists 
V* G ExpSolg(so) s.t. V < V*. Let e = | • x[mif^i(P*[i] — j7[z]) > 0 be the half of the minimal difference between v* and v 
in any component. Take p := p^. Then, (v* — ^ G ExpSolg(so 5 p) by Lemma 16. By the choice of e, v < v* — e, and thus 
V G ExpSolp (so,p). Einally, G[g] is unichain by Lemma 17. □ 
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The next lemma will be used later to derive a bound on the number of steps K that strategy should be played for in 
order to have a mean payoff close to v with high probability. 

Lemma 18. For any e > 0, there exist Kq G N and constants a,b > 0 s.t, for any K > Kq and state sq, 

P(1 <J<d)-\{MPK- >e]<g {K,e/2) :=2^-a- . 

Proof. Fix an error £ > 0, and let v* = [MP] the expected mean payoff when playing according to 5 ^/ 2 ■ By the choice 

of ( 7 e/ 2 , V — e/2 < F* < d. Thus, |(MPif — d)[j]\ > e implies |(MPic — F*)[j]| > e/2, and we have 

Plse/. P(1 <J<d)- |(MPk - d)[j]\ > £] < [3(1 <j<d)- |(MPk - ir)[j]\> 8/2] . (6) 

Since G\ge/ 2 \ is unichain by Lemma 17, we can apply Lemma 9, and, thus, there exist Kq and constants a,b > 0 s.t., for 
every state s and K > Kq, 

P(1 <J<dt)- \MPk[j] - ir[j]\ > £/2] < g iK,e/2) , 

from which the claim follows by Eq. 6 . □ 

Appendix B 

Beyond worst-case synthesis 

A. Finite-memory synthesis 

In this section, we give full proofs for some statements from Sec. Ill-A. Fix a game G and worst-case threshold jl0. 

Proposition 2. Let f be a finite-memory strategy satisfying the worst-case threshold problem. The set of states visited infinitely 
often under f is almost surely a winning EC. 

Proof. By Proposition 1, the set of states visited infinitely often by / is an EC U almost surely. By contradiction, assume 
that U is not winning with some positive probability. Since / is finite-memory, G[f] is finite. Since U is visited with positive 
probability, there exists a reachable bottom strongly connected component B in G[f] which projected to ^ is a subset of U. 
Since U is not winning, there exists a play in B with mean payoff f 0. By prefix independence of the mean-payoff value 
function, there exists a play in G[f] with mean payoff f 0 , contradicting that / is surely winning. □ 

Lemma 3. Let G be a pruned multidimensional mean-payoff MDP, let sg be a state in a WEC W of G, and let d G 
ExpSolgi^^y(so) with d > 0 be an expectation achievable by remaining inside W. There exists a randomized finite-memory 
strategy h G Af{G) s.t. (0; F) G BWCSolg|^^.j^(so,/i) that also remains inside W. 

Remark 9. The statement of the lemma holds even with h a pure finite-memory strategy, by applying Remark 1 in the 
construction of strategy gff^ below. 

The rest of this section is devoted to the proof of Lemma 3. To simplify the notation, we assume w.l.o.g. that the MDP G 
reduces to a single WEC W = S. Therefore, G = G \ ^■ In the following, let 

WCSolg(so, /) = {/I e I Vtt G ■ MP(7r) > fff , 

WCSolp(so)= U WCSolg(so,/) . 

/eA(e) 

Recall that is the set of plays originating from state sq which are consistent with strategy /. Since we are in a WEC, 
0 G WCSol 0 (so), and thus WCSolp(so) contains a vector /T > 0. Moreover, since d G ExpSolg(so) with F > 0, we can 
assume that d G ExpSol0(so) with z7 > 0. W.l.o.g. we further assume fi < d. We show that, for every ^ > 0 and e > 0, there 
exists a randomized finite-memory strategy satisfying the BWC threshold > {fi — 6;d — e). The construction of 
relies on the existence of the following two strategies; 

• Since jl is guaranteed in the worst case, by Lemma 2, for every <5 > 0 there exists a pure finite-memory strategy f^‘^ s.t. 
{fl — S) G WCSolg(s, f^‘^) for every state s in the WEC. 

• Similarly, since d is achievable in expectation from state sq, by Lemma 1, for every e > 0 there exists a randomized 
finite-memory strategy s.t. {d — e) G ExpSolg(s, 5 “^) for every state s in the WEC. 

The strategy is parameterised by two natural numbers AT, L G N, and it alternates between gff^ and ff'^ over periods of 

’ 2 2 

length K and L, respectively; 

(a) Play gff^ for AT > 0 steps and record in Sum G the current sum of the weights since the beginning of the period. 

(b) If Sum > {fl — S) ■ K, then go to (a). Otherwise, play ff‘^ for L > 0 steps, and then go to (a). 
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Every time a new period starts, the memory of the relevant strategy is reset. It remains to determine the values of the parameters 
K and L to reach the desired accuracy. The parameter K, which depends on e, controls the probability that a period is of 
type (a) or (a)+(b), and thus the quality of the approximation of the expectation objective; the larger the K, the higher the 
probability that a period is of type (a), and thus the closer the expectation to v. The parameter L (dependent on K and S) 
controls the length of the recovery period (a)+(b), and thus the larger the L, the closer the worst case to fl. 

We show that we can always choose L s.t. the worst-case objective is satished. 

Lemma 19. For every i5, £ > 0 and K gN, there exists L gN s.t. L = 0{^) and (p — S) G \NCSo\g{s,hJ^^) for every state 
s in the EC. 

Proof. Let m be the product of the size of the memory of and the number of states in Q, and let /r* > 0 be the smallest 

2 

component of fl. W.l.o.g. we assume 5 < p,*, since WCSolp(s, is downward-closed. Below, we derive an expression 

for L for the worst-case objective > p — 5 to be satisfied. Let tt be any h^™^-consistent play. We decompose tt = poPi • • • 
according to periods (a) or (a)+(b). If pi has type (a), then TP(pi) > {fl — 5) ■ K directly from the dehnition of periods of 
type (a). If pi has type (a)+(b), then at the end of the (a) part (of length K) the sum of weights is at least —K ■ W in every 
component. (Recall that W is the largest absolute value of any weight in Q.) Moreover, during the following (b) part (of length 
L), we have that every time the same memory state of and state of Q repeats, the mean payoff is at least fl — thus 
yielding a sum of weights which is at least —m ■ W + {L — m) ■ (p* — |) in every component. Thus, for every component 
0 < j < fc, we have 

TP{pi)[j] > —K ■W — 'm-W + {L — m)- ^p* — ^ 

In order to have MP(7r) > fl— 6, it suffices to have MP(pi) > ft — S for each period i since periods are of uniformly bounded 
length, and thus TP(pi) > {fl — 6) ■ {K + L), since this length is at most K + L. It is easy to see that the following choice 
for L satishes the constraint above: 

^ _ [2-K-{W + p*-S)+m-{2-W+ 2-p*-6)] 

L ^- (7) 

□ 



In the following, we consider L as fixed by Eq. 7. We show that one can always choose K s.t. the expectation objective is 
satished. We crucially use the fact that L is linear in K, and that the probability of periods of type (a)+(b) can be made 
negligible for large K. 

Lemma 20. For sufficiently small 6,e > 0, there exists K G N s.t. K = 0{-), and {f — e) G ExpSolg{s, for every state 
s in the EC. 


Proof. Let E{K, s) be the expected mean payoff vector in the MDP Q when Controller is playing according to starting 

from state s. We prove that there exists a K = 0{^) s.t. 

E{K, s) >v — e . 


Let E(^a){K,s) and E(^a)+{b){K, s) be the expected mean payoff of periods of type (a) and (a)+(b), respectively. Let p{K) 
be the probability of having a period of type (a)+(b). The expected length of a period is (1 — p{K)) ■ K -f p{K) ■ {K -f L). 
Similarly, the expected total payoff of a period is (1 — p{K)) ■ E(^ci){K,s) ■ K + p{K) ■ E(^a)+(b){KT s) ■ {K -f L). We thus 
obtain the following expression on the right for the expected mean payoff over one period, where the equality to the expected 
mean payoff over the entire play is easy to show: 


E{K,s) = 


(1 - p{K)) ■ {K, s)-K+ p{K) ■ .L(„)+(b) {K, s) ■ {K + L) 


{I - p{K)) ■ K + p{K) -{K + L) 

By dividing by (1 — p{K)) ■ K, we obtain the following inequality: 

^{a){K,s) + • E(^a) + {b){K,s) ■ ^ 

= K+L - 2 "-" 

l-p{K) ■ K 

Since E(^a)+(b)iKj s) > ft — S hy the choice of L (cf. the proof of Lemma 19), and /2 — <5 > 0 for sufficiently small 6, it 
suffices to hnd K s.t. 


E(a)iK, s) 


1 + 


P{K) 

t-p{K) 


K+L 

K 


> V — £ 


( 8 ) 
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From a qualitative point of view, when taking the limit for Ff —oo in Eq. 8 , 

• tends to a constant, since L is linear in K, and 

• p{K) tends to 0. 

Thus, limrf^oo E{K, s) = lim/f^oo E(^a){K, s) > F — |, where the inequality follows from the fact in periods of type (a) we 
are playing according to which guarantees expectation > F — | in the long run. Consequently, for every e > 0, there 
exists a K s.t. E(K, s) > iy — e, as required. 

The rest of the proof shows that K can be taken to be O(^). We proceed by an Hoeffding-style analysis. The following 
assumption can be satisfied for sufficiently small <5, e > 0, since /2 < F was assumed throughout this section. 

Assumption 3. v — £> p — 5 >Q. 

Lemma 18 provides a constant Kq and an upper bound Q [K, |) on the probability that the mean payoff deviates from F 
by more than | in any component when playing according to for every K > Kq. Thus, 1 — Q (iT, |) is a lower bound 
on the probability that the mean payoff in any component deviates from F by less than |. 


Assumption 4. Let K > Kq. 

We get the following lower bound on the expected mean payoff s) of a period of type (a): 


E^a){K,s)>{l-g(^K, 0 ) 




■ Vo 


where Fq > 0 is a lower bound on the mean payoff of any period of type (a). Therefore, we obtain the following simpler 
bound; 




(9) 


Moreover, we can use Lemma 18 also to provide a bound on p{K), the probability of a period of type (a)+(b). If a period 
is of type (a)+(b), then there exists a component I < j < d s.t. TP/f[j] < 0, and thus MPif[j] < 0. Since the strategy gT^ 
achieves expectation > F— | > 0, we have |(M P/^ — v) [j] | > | for some 1 < j < d. The latter event happens with probability 
at most g (at, |) by Lemma 18. We thus obtain the following bound: 

p{K) < g (iT, 0 (10) 

By using Eq. 9 and 10 in Eq. 8 , it suffices to find a K s.t. 

(l-e(iT,f))-(F-f) 


1 + 




Let 7 be an upper bound on ^ ^ ^ ■ Then, it suffices to show 


V(1 <j<d) 


, > U — £ 

g(^.f) K+L ~ 
l-e(iV,f) K 

;n, it suffices t( 
(l-e(iT,|))-(z7[i]-f) 


1 + 7 


> v\j] - £ 


if v(i < j <d) i-g( k, 0 

if V(1 < J < d) i-g (k, 0 

if V(1 <j<d) g( k ,0 


> 


> 


(d[j]-g)(l+7) 

m - § 

v\j] - g + 7 • iv[j] - g) 
^[j] - § 


f - 7 • {v[j] - e) 

- § 


( 11 ) 


Let i/* = max{i7[l],..., d[d]} be the maximal component of d. We show that with the following value of 7 we can satisfy 
all objectives. 

Assumption 5. 7 := 7 


4 — 

We now plug in the value for 7 into Eq. 11, to obtain g (AT, g/4) < -— "* Lie -, yielding the following bound on 

Q{K,£lA): 
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Before solving the inequality above for K, we go back one step discuss the constraint derived from the dehnition of 7. Recall 
that 7 should satisfy < 7. By replacing 7 with its dehnition 7 := 7 • in the latter inequality, we obtain 

the following inequality: 


g{K,^) K + L e 




K 


4 V* - e 


(13) 


2-K-{W+ti‘-S)+m-{2-W+2-iJ.--S) 

5 


K ^ K 

K+L / 9 I 2{W+n*) 

K - S 


, and thus L < 

±1 

(5 


2-K-(W+ti')+m-{2-W+2-tJ.*) 

5 


^ 2-iW+iin-iK+m) ^ 


By the dehnition, L = 

Thus, ^ 2iw+fin 1 . 2m{w+tin+s ^ Jjj particular, for K > j^ave 


Assumption 6. K > Ki := max{iTo, 2m(D£+iLi±i}_ 


Thus, for K > Ki, it suffices to satisfy the following inequality: 


We perform the following algebraic manipulations: 

G{K,l) 2 

<5 




4 V* - £ 


(14) 


if 

if 

if 

if 

if 

if 


{W + + 8) < 




1 £ 

4 v* -e 
1 <5 


1 + 


< 


W + +5 V* -£ 

15 £ 


(i-s(a',|)) 


^ 4 ) V‘ ' 8 W + ti* +& - £7 “ 8 W + 11* II*-e 

g (iT, 0 (8(fF + fji*+ 5)iv* - e) + fe) < fc 


Se 


< 


S{W + /i* + S){h'* — e) 6e 
Se 


^ (^’ 4 ) ^ {8fi* + 8kk + 88) ■ V* - {8fi* + 8W + 75)*£ 
^ (^’ 4 ) - ( 8 ^ + sw: + 8 ) • 14 * - (^ + sw: + 7) • £ 


(15) 


We now compare the bounds on g [K, |) given by Eq. 12 and 15, and we show that, for sufficiently small £, the latter bound 
implies the former. Therefore, we seek for £ s.t. 


if 

if 

if 


(T- 


<5 ^ (5 
8fi* 8Vk 


^+ 8 ) - 14 * -( 8 ^ 


^1 £ 

^ + 7 ) • £ “ 2 ■ 2iy*-£ 


8fi* 8Vk 
8fi* 81E 


+ 4 


■ £ > A - V* —2 ■ £ 

u > 


8fi* 

~ 


8 kk 


_ ^ 8(/i* + IE) + 45 _ 

^ - 8(/r* + W) + h8'^ 


Assumption 7. £ < £0 := 


8(/i' -|-l4^)-|-4(5 
8 (m*+W)+5<5 ■ ^ 
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Therefore, for sufficiently small e < < v* A suffices to solve Eq. 15. Recall that Q (iT, |) = a ■ 2‘^ ■ e /i®. Thus, 

we seek for K s.t. 


^ . 2d . g-b if eVl6 < 


if 


if 


if 


if 


if 


^-bKe^/16 


— / Sm* 
s 

1 


+ 8). J.* - (2f + ^ + 7) • e 
e 


< 


■2^ 


^ 8(1* 
^ (5 


+ ^ + 8) ■ !/* - (2^ + ^ + 7) • e 


-b-K- e^/16 < In 


a. 2d (^ + 


K > 


K > 


K > 


16 

b ■ 

16 
b ■ 

16 
b ■ 


■ In a • 2" 


■ In a • 2'“ 


(^ + T+8 )'^*-( 


^ + 7) -e^ 

8/i* 

(5 


+ ^ + 7)-e 


C-f + ^+8) 


In a + d In 2 + In 


8/r* 

~T 


81T 


+ Ini/* — Ine I =: K-, 


(16) 


The constant K 2 is of the form -%, with a linear in e and d (recall that a is exponential in e), and polynomial in the 


characteristics of the Markov chain. Thus, K = 0{^). 


□ 


B. Infinite-memory synthesis 

I) Inside an EC: We complete the proof of Lemma 6 by showing the two claims. The first claim relies on Lemma 3.9 in 
[19], while the second claim relies on Lemma 18. 

Claim 1. There are rational constants a and b with b < 1 s.t., for every i >1 and for sufficiently large K, pi^K £ « • 

Note that a and b do not depend neither on K, nor on i. 

Proof. Recall that pi^x is, by definition, the probability that the total payoff goes below Ni in any component during phase i. 
Since at the beginning of phase i the total payoff is TP^f.^ > 2 • by definition, pi^K is upper bounded by the probability 
that the total payoff decreases by at least Ni in any component: 

P^,K < [3(iT ■i<h<K-{i + l))-JPh-TPK■^ / -N^ 

For any state s in the EC, and payoff threshold N G N, let Ps,n,k be the probability that the total payoff goes below —N in 
any component when starting from 0 within at most K steps: 

Ps,N,K = [3(0 < d < iT) • 3(1 < j < d) • TPh[j] < -N] 

Thus, pi^K < ^iiiseSPs mintv K' prove the claim thanks to the following lemma. (It follows from Lemma 3.9 in [19]. 
For completeness, we provide a proof below.) 

Lemma 21. There exists a rational constant c < 1 and an integer N* > 0 s.t., for every N > N* and starting state s in 


GUk] s.t. P: 


siik] 


MP > 0 


= 1 , 


nN 


Ps,N,K < 2 ° 


1 — C 


By Lemma 1 we can assume that when playing according to we obtain a Markov chain Gif^^^] which is unichain. Since 
jexp strictly positive expected mean payoff and G[f^^^] is unichain, the mean payoff is almost surely strictly positive, and 
thus we can apply Lemma 21 to the Markov chain Glf^^^] and obtain constants N* and c s.t. for every N > N*, K > 0, 
and state s, Ps,n,k < L^t v* = mini<j<rf z7[j], and let K* be large enough s.t., for any K > K*, Nf^™ > N*. (For 


example, take K* = 2^.) Assume we are in any phase i >1. Since W™™ > N* by the choice of K*, by Lemma 21 we 
obtain, for every AT > K*, = 2^- ^^^^ . Since < minggswe havepi^^g < 2^ ^ . 

The claim follows by taking a = 2^/(1 — c) and b = c~, which is < 1 since c is. □ 

Proof of Lemma 21. We first prove the lemma in the unidimensional case. Let T be the set of states from where the total 
payoff is almost surely — cxd, and let T* be the set of states which can reach T with positive probability. Lemma 3.9 in [19] 
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when applied to the Markov chain QIJk] (in general, it applies to Markov decision processes), guarantees that there exists a 
rational constant c < 1 and an integer N* > 0 s.t. for every N > N* 


„N 

pei/ir] pj g pj. Tp^ < and T* is not visited ] < - 


Since we assume [MP > 0] = 1, by Lemma 12 the probability to have a total payoff equal to —oo is 0, and thus T is 

empty. Therefore, the probability on the left above is just pf [3i € N • TP^ < —TV], which is the definition of Ps,N,K, and 
thus 

Ps,N,K < - 

1 — C 

Now let d > 1, and, for a component 1 < j < d, let pi be the probability that we have a drop by —N in the total 
payoff in component j, i.e., 

[3* e N ■ TP,[j] < -TV] 

By looking at the Markov chain GI/k] projected to component j, we can apply the result from the first part and obtain that, 
for every fixed j, there exist cj < 1 and TV* > 0 s.t., for every TV > TV*, 




Ps,N,K ^ 

By taking TV* = maxi<j<dTV* and letting c be the Cj maximizing we have, for every TV > TV*, 




Ps,N,K ^ 

By definition, 1 —m k probability that the total payoff never goes below —TV in component j, and thus (1 —^ 

is a lower bound on the probability that the total payoff never goes below —TV in any component. Therefore Ps,n,k < 
1 —p^ jY 1 — — By a simple calculation (cf. the proof of Lemma 9), we derive (1 —> 1 — 2'^-:^, 

and thus 


Ps,N,K < 1-1 — 2 




1 — C 


r.N 


= 2" 


1 — c 


□ 


as required. 

Claim 2. There exist rational constants a and b with b < 1 s.t., for every z > 0 and sufficiently large K, qi^x < a ■ 6^3»+i)_ 
Note that a and b do not depend neither on K, nor on i. 


Proof. Recall the definition of = 




TPif-(i+i) 2 ■ TVi+1 
q.,K = [MP;^.(,+1) d 


. By the definition of TV^+i = ^ 


Let d* be the expected mean payoff of and let S = mini<j<ci(TT* [j] — d[j]) > 0. Then, qi^x is upper bounded by the 
probability that the mean payoff deviates from its expected value d* by more than S in any component. By Lemma 18, this 
is upper bounded by a'2‘^e~^ (i+i)KS sufficiently large (z + 1)K, and thus for sufficiently large K. Take a = a' ■2'^ and 
b = <1. □ 


C. Inside a general MDPs 

In this section, we prove the correctness of the reduction of the BWC problem to solving system T'. 

Lemma 7. Let G be a pruned multidimensional mean-payoff MDP, let d > 0, and let sq € S. There exists a (possibly 
infinite-memory) strategy h s.t. (0]d) £ BWCSolg (sq,/ i), if and only if the system T' has a non-negative solution. 

The following proposition is the analogous of Proposition 3 where we consider MEC (instead of MWEC) and arbitrary 
strategies (instead of finite-memory ones). As before, it rests on the assumption that all states of G are reachable from sg- 

Proposition 6. Let G be a pruned multidimensional mean-payoff MDP. If there exists a strategy h s.t. ((); d) £ BWCSolJ (sg, h), 
then there exists a strategy h* with the same property, and such that, for every MEC U, the set of states visited infinitely often 
by h* is a subset of U with positive probability. 

The following proposition (and its proof) is analogous to Proposition 4, where MWECs are replaced by MECs. 
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Proposition 7. IfT' has a non-negative solution, then there exists a (finite-memory) strategy h s.t. 

1) V £ ExpSolg (so, h). 

2) For every MFC U, there is a probability y^>0 s.t. the set of states visited infinitely often by h is inside U with probability 

y*u- 

3) The set of states visited infinitely often by h is almost surely an EC. Consequently, jjyf = 1. 

4) Once inside a MEC U, h achieves expected mean payoff vu > 0. 

TImec uyu ' > tJ. 

Proof of Lemma 7. The proof is similar to finite-memory case in Lemma 4. We sketch here the crucial differences. 

For one direction, let h be a strategy s.t. (0; F) € BWCSolJ (sq, ^)- By Proposition 6, there exists a strategy h* that 
additionally visits each MEC infinitely often with positive probability. By the construction in Proposition 4.4 of [9] applied to 
strategy h*, we obtain a solution to system T'. Equations (Al), (Al’), (A2), (Cl), (Cl’), and (C2) are shown to be satisfied in 
the proof of Proposition 4.4. Eq. (B-bis) is satisfied since, by Proposition 1, h* is eventually trapped in an EC almost surely 
(not necessarily a WEC). Einally, Eq. (C3-bis) is satisfied: By construction, h* visits each MEC infinitely often with positive 
probability. Eor every MWEC U there exist s,t € U s.t. > 0. Since h* is winning for the worst-case, it achieves an 
expected mean payoff > 0 in (7, and thus Eq. (C3-bis) is satisfied. 

Eor the other direction, we use Proposition 7 to obtain strategy h, and then we proceed as in the second part of the proof 
of Lemma 4 by replacing MWEC with MEC, and by using Lemma 5 instead of Lemma 3 in the construction of strategies 
Lc/’s. □ 
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